Academia.eduAcademia.edu

IIIT Hyderabad at TAC 2009

2009, Proceedings of Test …

IIIT Hyderabad at TAC 2009 by Vasudeva Varma, Vijay Bharath Reddy, Sudheer K, Praveen Bysani, GSK Santosh, kiran kumar, kranthi Reddy, karuna Kumar, nithin M in Text Analysis Conference National Institute of Standards and Technology Gaithersburg, Maryland USA Report No: IIIT/TR/2009/222 Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA November 2009 IIIT Hyderabad at TAC 2009 Vasudeva Varma Vijay Bharat Sudheer Kovelamudi Praveen Bysani Santosh GSK Kiran Kumar N Kranthi Reddy Karuna Kumar Nitin Maganti Search and Information Extraction Lab Language Technologies Research Center IIIT Hyderabad, India. Abstract In this paper, we report our participation in Update Summarization, Knowledge Base Population and Recognizing Textual Entailment at TAC 2009. This year, we enhanced our basic summaization system with support vector regression to better estimate the combined affect of different features in ranking. A Novelty measure is devised to effectively capture relevance and novelty of a term. For Knowledge Base Population, we analyzed IR approaches and Naive Bayes Classification with Phrase and Token searches. Finally for RTE, we built templates using WordNet and predict entailments. Part I Update Summarization Track 1 Introduction Update Summarization is a new stride in summarization community. Ever since its introduction in DUC 2007 1 , there has been a consistently growing focus in this direction. The task is to summarize a cluster of documents under the assumption that user had some prior knowledge on topic. The major challenge in update summarization is to detect information that is not only relevant to users need but also novel given the user’s prior knowledge. Update summarization is relevant for newswire, since a topic in news stories evolves over time and user/reader would only be more interested about new information about that topic. NIST first introduced update summarization as a pilot task at DUC 2007, later as a main task at TAC 2008 and continued it in TAC 2009. While the problem definition remained the same, the quality of data have improved through out the years. In DUC 2007, the 1 http://www-nlpir.nist.gov/projects/duc/duc2007/tasks.html#pilot update task data was just a subset of Multi-Document Summarization data. In 2008 sufficient care is taken such that there are distinctive events between clusters, but a huge time gap is observed between clusters. For TAC 2009, a lot of time and efforts has been put into choosing news topics and creating appropriate document clusters. Update summarization shares similarity with Novelty track introduced at TREC 20022 . The Novelty track was designed to investigate systems’ ability to locate relevant, novel information within the ranked set of documents retrieved in answer to a topic. Update Summarization is in a way an extension to Novelty track as it needs to summarize the content along with detecting relevant and novel information. Researchers have approached the problem of “Update Summarization” at varying levels of complexity during past couple of years at TAC. Ruifang He et. al [6] proposed an iterative feedback based evolutionary manifold ranking of sentences for update summarization. Ziheng Lin et. al [20] followed time stamped graph approach incorporating information about temporal ordering of events in articles to focus on the update summary. There are also simple content filtering approaches [19] which identify dynamic content and generate summaries. Recent advances in machine learning like perceptrons [4], markov models [13], CRF’s [17] and bayesian classifiers [9] have been adapted to summarization throughout the years. We use a machine learning method, Support vector regression (SVR) for sentence ranking. Sujian Li, You Ouyang [10] were the first to use regression in the context of text summarization to predict sentence scores. FastSum [16] also utilizes support vectors to score sentences using multiple features. In this work, we introduce a new feature Novelty Factor (NF), that is devised to capture relevance and novelty of a sentence within a topic. Computationally NF is a very simple feature, yet very effective. Official TAC results support our argument on NF. We 2 http://trec.nist.gov/data/novelty.html secured 1st position in avg-modified pyramid scores for cluster B, and 3rd in ROUGE-2 and ROUGE-SU4 recall scores for both clusters A, B. Our Post TAC experiments have shown significant imporovement over current results for cluster A. Sentence importance is estimated as the ROUGE-2 score of that sentence. The importance of a sentence s, denoted by is is computed as follows T P Bigrams | m∈models |Bigramm is = (1) |s| 2 System Description T |Bigramm Bigrams | is number of bigrams shared by both model m and sentence s. This count is normalized using sentence length |s|. We built a sentence extractive summarizer, that extracts, scores and ranks a sentence before finally generating a summary. During ranking, instead of manually weighting each sentence scoring feature 3, we utilize a machine learning algorithm, Support Vector Regression (SVR) to predict sentence rank. In following sections we briefly explain SVR, estimation of sentence importance and our algorithm to generate summaries. 2.1 Support Vector Regression Regression analysis refers to techniques for modeling values of a dependent variable from one or more independent variables. Support Vector Machines, a popular mechanism for classification purposes could also be used for regression purposes (Vapnike ,Gunn 1988) [5]. Consider the problem of approximating the set of training data T = {(F1 , i1 ), (F2 , i2 )...(Fs , is )} ⊂ F × R. where F is space of feature vectors. A tuple (Fs , is ) represents feature vector Fs and importance score is of sentence s. Each sample satisfies a linear function q(f ) = hw, f i + b, with w ∈ F, b ∈ R. The optimal regression function is given by minimum of functional, X 1 Φ(w, ξ) = kwk2 + C ξi − + ξi + 2 i where C is a pre-specified value, and ξi − , ξi + are slack variables representing upper and lower constraints on the outputs of the system. We use Radial bias kernel function for our experiments. 2.1.1 Sentence Importance Estimation Importance score (is ) is not pre-defined for sentences in training data, we estimate the value of importance using human written summaries(also known as models) on that topic. ROUGE-2 and ROUGE-su4 scores highly correlate with human evaluation [11]. Hence we make a safe assumption that importance of a sentence is directly proportional to its overlap with model summaries. 2.2 Algorithm Our system follows a 3 stage algorithm to generate summaries, 1. Pre-Processing In pre-processing stage, documents are cleaned from news heads and HTML tags. Stop words are removed and porter stemmer is used to derive root words eliminating suffixes are prefixes. Sentences are extracted from each document. 2. Feature Combination Features used for sentence scoring are combined to rank sentences. Normally features are manually weighted to compute sentence rank. This process is automated with use of SVR in 3 steps, • Sentence tuple generation: Feature values of every sentence are extracted and its importance(is ) is estimated as described in Section 2.1.1. Each sentence s in training data is converted into a tuple of form (Fs , is ). Fs is vector of feature values of sentence Fs = {f1 , f2 , f3 } • Model building: A training model is built using SVR, from generated sentence tuples. • sentence scoring: Importance of a sentence in testing dataset is predicted based on the the trained model. The estimated importance value is considered as rank of sentence. is = q(Fs ) 3. Summary Generation During summary generation, a subset of ranked sentences are selected to generate summary. A redundancy check is done between a sentence and summary generated so far, before selecting it into summary. This step helps to prevent duplicate sentences in summary. Sentences are adjusted based on their order of occurrence in documents to improve readability. Reported speech is removed from summary to alleviate its conciseness. 3 Features 3.2 For our previous participation at TAC 2008 and DUC 2007, complex language modeling techniques like Probabilistic hyperspace analogue to language (PHAL) [7], KullbackLeibler divergence (KL) [1] are used as features in our system. This year we have used Sentence Position (SL1 and SL2) and Novelty Factor (NF) that is specifically devised for update task as sentence scoring features. We propose a new feature Novelty Factor (NF) that primarily focuses on update summarization problem.Consider a stream of articles published on a topic over time period T. All the articles published from time 0 to time t is assumed to have been read previously (previous clusters). Articles published in the interval t to T are unread articles that might contain new information (new cluster). Let the publishing date of a document d is represented by td . NF of a word is calculated by 3.1 Novelty Factor (NF) Sentence position Sentence position is a very old and popular feature used in summarization [3]. It is well studied and still used as a feature in most state of art summarization systems [8]. We use the location information of a sentence in two seperate ways to score a sentence. Sentence Location 1 (SL1): First three sentences of a document generally contain the most informative content of that document which is proved by our analysis on the oracle summaries (in Section 5.1). Nearly 40% of all the sentences of the oracle summaries come from among the first three sentences of each document. Score of a sentence s at position n in document d is given by, n Score(snd ) = 1 − 1000 n = 1000 if n <= 3 else (Assuming that number of sentences in a document will be less than 1000) Such that, Score(s1d ) > Score(s2d )... >> Score(snd ) Sentence Location 2 (SL2): Positional index of a sentence in the document is assigned as the value of feature. Training model will learn the optimum sentence position for the dataset based on its genre. Hence this feature is not inclined towards top or bottom few sentences in a document like SL1. Score(snd ) = n where sn is nth sentence in document d. nf (w) = |ndt | |pdt | + |D| ndt = {d : w ∈ d ∧ td > t} pdt = {d : w ∈ d ∧ td <= t} D = {d : td > t} Numerator |ndt | is the number of documents in the new cluster that contain word w. It is directly proportional to relevancy of the term, since all the documents in the cluster are relevant to the topic. The term |pdt | in denominator will penalize any word that occurs frequently in previous clusters, in other words it elevates novelty of a term. |D| is total number of documents in current cluster, this is useful for smoothing values when w do not occur in previous clusters. Update task data in TAC 2009 consists of two clusters, cluster A is the only previous cluster and B is the new cluster. Hece the NF for a word in cluster A dA nfclusA (w) = D A cluster B nfclusB (w) = dB DA +DB Score of a sentence s is the average nf value of its content words (w). We exclude query words while computing the score since they are equally important in both the clusters. P nf (wi ) Score(s) = i∈s |s| NF score of a sentence is a measure of its relevance and novelty to the topic. 4 Evaluation and Results TAC 2008 update task documents and corresponding models are used to generate data for training SVR. It provides 48 topics, each topic contains 20 documents divided in chronological order between cluster A and cluster B. Summary for cluster A is normal multi document summary of length 100 words, where as summary for cluster B is an update summary of 100 words. TAC 2009 Update summarization data has 45 topics with documents distributed in each topic the same way as TAC 2008. NIST evaluates all peer summaries manually for overall responsiveness, readability and linguistic quality. All summaries were also automatically evaluated using ROUGE and BE. Evaluations are also conducted using pyramids [15], which are built using Semantic Content Units(SCU) from corresponding model summaries of a topic. We submitted two runs, Run1 and Run2 for TAC 2009 update summarization track, Run 1(System id:35) uses two features NF and SL1 for sentence ranking. The training model for SVR is built upon TAC 2008 update task data. Run 2(System id:51) generates summary using two features NF and SL2. DUC 2007 update task data is used to build training model. 4.1 Results Official TAC evaluation results of Run1 and Run2 on various intrinsic and extrinsic evaluation criterion for clusA and clusB are presented in Table 1 and Table 2 respectively. Run1 has proved to work exceptionally well in update scenario. Run1 is ranked 1st in Average modified pyramid scores(APS), 4th in Overall Responsiveness(OS) and 3rd in ROUGE Scores(R2 and Rsu4) for cluster B which is essentially the update cluster. It worked equally well for cluster A, given that it secured 3rd in ROUGE, 4th in Overall Responsiveness. Average modified pyramid scores of cluster A is slightly below than best systems. Run2 is an experimental run to find the effect of training on overall performance of the system. The main difference between Run1 and Run2 is the quality of training data. Even though Run2 performs comparatively well than most of the systems that participated, it still does not make to the top unlike Run1. Hence it is inferred that as the quality of training improves so do the performance of system. 5 Post TAC experiments 5.1 Oracle Summaries We generated sentence-extractive Oracle Summaries using test document set and their model summaries. Each oracle summary is the best sentence extractive summary that can be generated by any sentence extractive summarization system for that topic. Sentences are ranked using Equation 1 to produce these summaries. Motivation behind generating these sum- maries is to find the scope of improvement in sentence extractive summarization. After TAC submission, some new and simple features are devised to improve the summary quality especially in update scenario. 5.2 Query Focus All the features (NF,SL1, and SL2) in current summarizer are query independent and makes no use of any information provided by query. Even without query focus these features are able to perform very well. Hence, a new feature Qterms is introduced to incorporate query focus into current summarization system. Qterms In query focused summarization systems, a sentence is considered relevant only if it contains a query term. We use the same intuition to score query focused sentences. A sentence (s) is scored by the amount of query terms it contains P FQ (w) Score(s) = w∈s |s| where FQ (w) = n = 0 if w ∈ Q else Q is list of all the words in query n is frequency of w in Q 5.3 Novel Word Count (Nwords) Novelty Track at TREC is in many ways similar to the update summarization. The task is to mark novel sentences within a set of relevant sentences for a topic. As state of art summarizers are sentence extractive and update summarization requires to extract novel sentences to build an update summary, novelty track approaches can compliment update summarization in identifying sentences with new information. A simple approach to detect novel sentences is to compute the amount of new words in a sentence. Words that never occurred before in document cluster are considered new. In this case, we consider any word that occured in cluster A other than query words as old and all the remaining words as new. A sentence (s) is scored by the amount of Novel (New) words it contains, P FclusA (w) Score(s) = w∈s |s| FclusA (w) = 0 if w ∈ clusA = n/N else Run1 Run2 R-2 0.10840 0.10491 R-su4 0.14475 0.14167 Average modified Pyramid score 0.299 0.295 Overall Responsiveness 4.864 4.727 Table 1: TAC Official results for cluster A Run1 Run2 R-2 0.10100 0.09572 R-su4 0.13833 0.13644 Average modified Pyramid score 0.307 0.299 Overall Responsiveness 4.614 4.568 Table 2: TAC Official results for cluster B clusA is set of words in cluster A n is number of times w occured in cluster B N is total term frequency of cluster B We present in Table 3, Table 4, the effect of these new features along with oracle summaries to depict the scope of improvement in sentence extractive summarization. Run1+Qterms Oracle R-2 0.11350 0.15620 R-su4 0.14969 0.18093 Table 3: ROUGE-2 and ROUGE-SU4 scores of PostTAC experiments for cluster A Run1+Qterms Run1+Nwords Run1+Nwords+Qterms Oracle R-2 0.09106 0.09807 0.08923 0.14978 R-su4 0.13132 0.14058 0.13047 0.17767 Table 4: ROUGE-2 and ROUGE-SU4 scores of PostTAC experiments for cluster B Query focus (Qterms) has helped in producing better summaries for cluster A. Around 4% improvement in ROUGE scores is observed. It is interesting to see that, the same feature failed to improve quality of cluster B summaries. Inclusion of New Novelty feature (Nwords) does not show any significant growth over Run1. Adding Qterms to this combination only resulted further dip in scores. 6 Discussion Novelty Factor(NF) is able to perform well in summarization because all the documents given under a topic are relevant to it. Hence, the importance of a term is directly proportional to the number of documents in which it occurs. In context of update summarization, there is a need to penalize the terms that are authoritative in cluster A. So, the importance of a term in clus- ter B is inversely proportional to the number of documents in which it occured in cluster A. Both sentence positional algorithms (SL1 and SL2) have performed decently. SL1 is inclined towards top sentences in a document, and SL2 is unbiased towards positional index of a sentence. SL1 works because of the intuition that top sentences would always have informative content. SL2 is a feature that helps boosting informative sentences based on genre of the corpus. SL2 learns significant sentence positions in corpus based on training data. As both training and testing belong to same genre (Newswire) SL2 performs well. In our experiments both SL1 and SL2 performs in a similar way as important sentences in training data are present among top 3 sentences of a document. The major difference between Run1 and Run2 is the amount and quality of training data. DUC 2007 update task data is just a subset of Multi Document summarization data and TAC 2008 data is specifically handcrafted for update task. It is observed that Run1 produces better summaries than Run2 both in automatic and manual evaluations. There is still a huge gap between ROUGE scores of oracle summary and machine generated summaries. It reveals the fact that there is a lot of scope for improvement in sentence extractive summarization. In future we plan to involve NF within a formal language modeling framework. We are currently working on predicting word level importance using SVR. Experiments are being carried out to predict the role of word position in a sentence to decide its importance. Further we plan to explore the use of entailment in summarization scenario. Part II The advantages of using wikipedia are Knowledge Base Population (KBP) • It has better coverage of named entities[18]. • Redirect pages provide us with synonyms[18]. • Disambiguation pages can be used for homonym resolution[14]. 1 Introduction TAC 2009 Knowledge Base Population (KBP) track focuses on automatic updation of structured Knowledge Bases (KB). The task has been broken down into two sub tasks. 1) Entity Linking: The aim of this task is to determine for each query, which knowledge base entry is being referred to, or if the entry is not present in the knowledge base. The query consists of a namestring and a document id from the document collection. Each query string will occur in the associated document in the test collection. The purpose of the associated document is to provide context for that might be useful for disambiguating the name-string. Each query must be processed independently of one another and for each query a knowledge base node id should be returned if present else NIL. 2) Slot Filling: The Slot Filling task involves learning a pre-defined set of relationships and attributes for target entities based on the documents in the test collection. A query in the Slot Filling task will contain a name-string, docid, entity-type, node-id, and a list of slots to ignore. The node id that is provided will refer to a node representing the entity in the KB. For targets for which no node exists in the KB, the nodeid will begin with NIL. As in the entity linking task the provided docid is intended to give context for the entity. Systems must process the target entities (i.e., each query) independently from one another. For each slot value returned, systems must also return a single docid from the test collection that supports the value returned for the given entity and slot. Slots can be single valued slots or multi valued slots. We use the following link structure from Wikipedia to extract the name variations of an entity. (a) Redirect Links: A redirect page in wikipedia is an aid to navigation. When a page in wikipedia is redirected it means that those set of pages are referring to the same entity. They often indicate synonym terms, but can also be abbreviations, more scientific or more common terms, frequent misspellings or alternative spellings etc. (b) Disambiguation Pages: Disambiguation pages are specifically created for ambiguous entities, and consist of links to articles defining the different meanings of the entity. This is more useful in extracting the abbreviations of entities, other possible names for an entity etc. (c) Bold Text from First Paragraph: In wikipedia the first paragraph usually contains a summary of the article or most important information about the article, thus containing the most relevant words for that article. We extract phrases from the first paragraph of wikipedia article that are written in bold font. This bold text generally refers to nick names, abbreviations, full names etc. (d) Metaphones: In order to identify spelling variations for a given entity we use the metaphone algorithm [2] .We generate the metaphonic codes for each token generated using the above 3 features. (e) Lucene: Lucene 4 is used as an underlying retrieval system by us to retrieve entity mapping from the knowledge repository created using the above features. Metaphone codes for each token of knowledge repository is also stored in this index. We also index the knowledge base and wikipedia to facilitate fast retrieval of data. 2 Approach For Entity Linking We have broken down the Entity Linking task into three separate modules. 1. Preprocessing: Our aim during the preprocessing step is to build a Knowledge Repository of entities that contains vast amount of world knowledge of entities like name variations, acronyms, confusable names, spelling variations, nick names etc. We use Wikipedia3 to build our knowledge repository. 3 Wikipedia is a huge collection of articles. Each article is identified by a unique title. These articles define and describe about events and entities. 2. Candidate List Generation: Since the entity linking task involves the determination of 4 Lucene is a high-performance, full-featured text search engine. http://lucene.apache.org/java/docs/. whether a knowledge base node exists or not for a given query, we try to generate the possible candidate maps for a given query from the knowledge base as well as wikipedia. The motive behind adding articles from wikipedia also to the candidate items list is that our knowledge base is a subset of articles from wikipedia. The addition of wikipedia articles to the candidate items list helps us in identification of nill valued queries, that is if for a given query we have nodes from both knowledge base and wikipedia in our candidate items list and our algorithm maps to the wikipedia article we can conclude the nonpresence of the node in knowledge base describing about this entity. Candidate items list is generated using only the title information of the articles and the given query terms, also we follow different approaches for queries that are Acronyms and those that are not. In order to identify if a given query is an acronym or not we use a simple heuristic based approach. If all the letters present in the query are capitals we consider it as an acronym else it is not. The algorithm for handling these two cases is as follows. (a) Not an Acronym: If the given query is not an acronym we search for the query terms directly in the title field of the nodes of knowledge base. If an hit is found we add that node to the candidate items list. If no hit is found it means that the query could be a variation of the entity or variation of the entity is present in the knowledge base. We then search on the knowledge repository index created by us in order to get the possible variations for the entity. If we find any possible variations for an entity in our knowledge repository index, we search in the knowledge base index on title field using these variations. If any hits are found we add those nodes to the candidate items list. We also search in the wikipedia index using the retrieved variations. If any hits are found in the wikipedia index we add them to the candidate items list. If no variations are found in the knowledge repository index as well, we assume that the entity might have been written in a different form than it is usually written. we generate the metaphone code for each token present in the query. We search in the metaphone field of the knowledge base index. If any hits are found we add those nodes to the candidate items list. The algorithm is depicted in Fig.1 . Fig. 1: Flow chart when query is not an Acronym (b) Acronym: If the given query is an acronym we try to get the expanded form from the document content which has been given as disambiguation text. We remove stop words from the disambiguation text and use an N-Gram based approach to find the expanded form. If the length of the acronym is ”N” characters, we check if ”N” continuous tokens in the disambiguation text have the same initials as the characters of the acronym. If we are successful in finding the expanded form from the disambiguation text, we search in the knowledge base index using these tokens. If any hits are found we add those nodes to the candidate items list. If we don’t find the expanded form from the disambiguation text we search in the knowledge repository index for the acronym. If we find the expanded form in the knowledge repository index, we search in the knowledge base index using this expanded form. If any hits are found in the knowledge base index we add those nodes to the candidate items list. We also search in the wikipedia index using the expanded form. If we find any hits in the wikipedia index we add those articles to the candidate items list. We also search using the acronym in the the title field of the knowledge base and wikipedia index. If any hits are found we add those nodes to the candidate items list. The algorithm is depicted in Fig.2 . Refining the Candidate items list: We delete articles that belong to wikipedia from our candidate items list if it is also present in the knowledge base because if a node is present in the Knowledge base describing about an entity, it Party”. In token search we would be retrieving nodes that have titles either ”Chinese Communist party” or ”Communist Party of China”. The node having title ”Communist Party of China” is not found as a candidate item in the phrase search. 3. Calculating Similarity Score: If there are no nodes present in the candidate list, it means that no node is present in the knowledge base describing that entity. We return NILL in this case. If there is only one node belonging to knowledge base present as a candidate item , it means that we have found an exact match for the query and there are no other entities in our knowledge base node describing about this entity. But if we find that single node belongs to wikipedia, it means that the likely link for our query is not present in Knowledge base and hence we return NIL. Fig. 2: Flow chart when query is not an Acronym could be the possible link for our query. If there are more than one nodes in our candidate items list and all of those belong to wikipedia, we return NILL as there are no nodes present in the knowledge base describing about our entity. But if the candidate items list contains our knowledge base nodes as well, we follow a different approach to solve this problem. In this case we have to find which node is the mostly likely link for the given query. This can be solved in two different methods. While searching in the article title of the knowledge base or wikipedia index for generating the candidate item list we can apply 2 different methods. (a) Phrase Search: In this method we see if the exact phrase of the query or the expanded form of the acronym is present in the title field or not. Only on finding the exact phrase we add those nodes to the candidate items list. A variation of this is searching for phrase with a noise of one word. EG: If the given query is ”UT” and if we find the expanded form from our knowledge repository as ”University of Texas”. In simple phrase search we would be retrieving nodes that have exact phrase ”University of Texas” present in the title. But in Phrase search with noise we would be retrieving nodes that have the titles ”University of Texas at austin, University of Texas at Dallas” also. (b) Token search: In this method we check if all the tokens of the query or the expanded form of the acronym are present in the title field or not. Variation of token search is searching with a noise of one word. The difference between Phrase search and Token search is that in phrase search the word order is constrained where as in token search just the presence of each token is vital and not the order in which they occur. EG: If the given query is ”CCP” and if we find the expanded form from our knowledge repository as ”Chinese Communist Classification Approach: (a) Rainbow Text Classifier 5 : It has several built methods for classification like Naive Bayes[12], SVM, KL-divergence, TFIDF, K-Nearest Neighbor. We have conducted our experiments using Naive Bayes and KNearest Neighbors. If we consider all the possible candidate items as different classes, we need to find which class is the best map for our query. In this approach we use bag of words as a feature and build models for classification using Rainbow Classifier. For building these models we view each candidate item as a separate class and train the model. We then give the disambiguation text provided along with the query as test document. This test document is classified into one of the classes and the score obtained is the likelihood of the test document belonging to that class. (b) Information Retrieval Model: We index the candidate items text using lucene. Each 5 http://www.cs.cmu.edu/ mccallum/bow/rainbow/ candidate item is treated as a separate document. Query formulation is an important part in the success of this approach. We try and reduce as much noise as possible while generating the query from the disambiguation text along side trying to boost the terms that seem to the most likely terms that describe about our entity. Since the disambiguation text provided has been tagged neatly into different paragraphs, we consider only those paragraphs where the query terms are present. Once we have extracted all the paragraphs that contain the query terms, we remove the stop words. We form a boolean ”OR” query of all the tokens generated from the paragraph text. We give this query to the candidate items index and the relevance score for each document is calculated. Result generation: Once we are able to find the closest matching node for a given query using the disambiguation text we check if the node belongs to the knowledge base or the wikipedia. If the node belongs to the knowledge base we give the knowledge base node id as the mapping link else if the node belongs to wikipedia we give the link as NILL saying that we don’t have a link for the query entity in our knowledge base. 3 Description of runs We have submitted 3 runs for the entity linking task. 1)Phrase search with noise for candidate list generation and Naive Bayes for classification. 2)Token search with noise for candidate list generation and Naive Bayes for classification. 3)Phrase search with noise for candidate list generation and Information Retrieval approach. Table 5 shows the results of our various runs. It also includes post TAC experiments. Infact an IR approach with token search performs better than the runs we have submitted for TAC 2009. Micro-Average Score we obtained through our system outperforms all 35 runs submitted at TAC 2009. The average-median score over all the 35 runs is 71.08% and the base line score is 57.10% when NILL is returned for all the queries. Our system score outperforms median score by as much as 11% and the base line score by 25%. 4 Approach for Slot Filling In the slot filling task we have to populate the slot value pairs of an entity in the knowledge base. The query for slot filling task will be of the form (NameString, Doc-id, Entity-type, Node-id, Ignore-slots). Name-string refers to the entity name whose slot values pairs have to be modified and Entity-type refers to the whether the given entity is either a person, location or organization. Node-id will refer to the knowledge base entry which has to be populated. If the node-id is NIL, in that case we have to populate the whole template of the given entity type. 1. Preprocessing: For slot filling task, during the preprocessing stage we index all the documents of the document collection using lucene. This facilitates searching and fast retrieval of the documents. 2. Approach: Given a query, we search for the name-string on the index built during the preprocessing stage. We boost the documents that have name-string appearing in the headline of the document. From the retrieved results we consider the top 50 documents for further processing. Our assumption is that these top 50 documents might contain information related to the slots. Once we have these documents we need to extract the sentences that might contain the slot values. For this we index the sentences of the top 50 documents we retrieved as separate documents. 3. Query Formulation: Since we have been given a generalized mapping from different Wikipedia slot names to a single slot name of the knowledge base. EG: ”Nick names”, ”Also known as” are all mapped to ”Alias name”. So we have used this information during our query formulation. We form a boolean ”OR” query of all the tokens of the mappings provided. Now based on the slot value to be filled, we query the index built using the sentence extracted from the top 50 documents and consider the top 10 retrieved sentences for further processing. We assume that these top 10 sentences might contain our slot values. We have categorized the slot values as (a) Person (b) Organisation (c) Place (d) Date (e) Integer (f) String Algorithm Noise Phrase/Token search IR IR IR IR NB NB NB NB 1 1 0 0 1 1 0 0 Token Search Phrase Search Phrase Search Token Search Token Search Phrase Search Phrase Search Token Search MicroAverage Score 82.25 82.17 81.81 81.76 81.43 81.25 81.12 80.87 Nillvalued precision 86.32 86.41 86.90 86.45 85.42 85.37 85.91 85.51 Non-nill valued precision 76.84 76.54 75.04 75.52 76.12 75.76 74.75 74.69 Table 5: Results of Various Experiments. IR = Information Retrieval, NB = Naive Bayes Type Our Score Single Valued Slots List Valued Slots SF-Value score, All Slots 0.761 0.604 0.682 Median Score 0.514 0.439 0.461 Table 6: Results of Various Experiments. Now we need to extract the slot value from the 10 sentences we have. For this we have used Stanford Named Entity Recognizer(NER). We tag the 10 sentences using this NER and based on the slot value type we extract the value from the sentences. Different sentences might return different values, we consider the value with highest frequency as the correct slot value and return it as the output. For list valued slots we return the top 3 frequency results. 4. Results: Table 6 shows the results of our single run submitted at TAC 2009 for slot filling task. 5 Conclusion and Future Work We have presented a general overview of building our system for TAC 2009 which analyzes a query and it’s disambiguation text in order to find a link in a knowledge base. We showed that an indexing based approach with word search noise we able to outperform all the othe approaches. Our approach and experiments indicate that the 2 level methodology helps in achieving high accuracies. We believe that this approach is particularly promising because, Wikipedia is constantly growing and being updated. With it continuous growth we are gauranteed of the latest up to date information. Best Score 0.816 0.742 0.779 MacroAverage Score 75.70 75.39 75.46 75.54 75.38 75.10 75.45 74.96 Part III Recognizing Textual Entailment (RTE) 1 Introduction Textual entailment recognition is the task of deciding, when given two text fragments, whether the meaning of one text (Hypothesis) is entailed from the other text (Text). The definition of entailment is based on (and assumes) common human understanding of language as well as common background knowledge. We define textual entailment as a directional relationship between a pair of text fragments, which we call the Text (T) and the Hypothesis: (H). We say that: T entails H, denoted by T -> H, if a human reading T would infer that H is most likely true. For example, given assumed common background knowledge of the business news domain and the following text: T1: Internet Media Company Yahoo Inc. announced Monday it is buying Overture Services Inc. in a 1.63-billion dollar (U.S.) cash-and-stock deal that will bolster its on-line search capabilities. The following hypotheses are entailed: - H1.1 Yahoo bought Overture - H1.2 Overture was acquired by Yahoo - H1.3 Overture was bought - H1.4 Yahoo is an internet company If H is not entailed by T, there are two possibilities: 1) H contradicts T 2) The information in H cannot be judged as TRUE on the basis of the information contained in T. For example, the following hypotheses are contradicted by T1 above: - H1.5 Overture bought Yahoo - H1.6 Yahoo sold Overture While the following ones cannot be judged on the basis of T1 above: - H1.7 Yahoo manufactures cars - H1.8 Overture shareholders will receive 4.75 dollar cash and 0.6108 Yahoo stock for each of their shares. However, entailment must be judged considering the content of T AND COMMON KNOWLEDGE TOGETHER, and NEVER ON THE BASIS OF COMMON KNOWLEDGE ALONE. 2 Approach This time in RTE-5 the average length of the text is made higher than is RTE-4. Texts will come from a variety of sources and will not be edited from their source documents. Thus, systems will be asked to Fig. 3: System Architecture handle real text that may include typographical errors and ungrammatical sentences. The sentence structure of text and hypothesis would rarely be similar. Hence, there is a need to reduce dramatically the complexity of required input sentences. The major idea of our approach is to find linguistic structures, here termed templates that share the same anchors. Anchors are lexical elements describing the context of a sentence. Templates that are extracted from different sentences (text and hypothesis) and connect the same anchors in these sentences are assumed to entail each other. For example the sentences ’Yahoo bought Overture’ and ’Yahoo acquired Overture’ share the anchors {X= Yahoo, Y = Overture}, suggesting that the templates X buy Y and X acquire Y entail each other. In later subsections, a) the problem of finding matching anchors and b) identifying the template structure are addressed. Apart from this major idea, initially the dataset is preprocessed to extract the dependency trees using Stanford parser, and similarity values between common-nouns using WordNet similarity tool are also preprocessed as the WordNet would take few seconds to load for every T-H pair and that would be costly if we want to make a quick run over the test set. Additionally, we added few rules and used semantic resources such as Stanford NER, VerbOcean (for comparing verbs alone), MontyLingua (to get the base form of a verb) and Acronyms dataset. Fig:3 best describes our system 3 Pre-processing The corpus is first treated with the Stanford parser to generate dependency trees. We used the collapsed version of the dependency trees. To recognize the Named Entities, the Stanford Named Entity Recognition tool has been used. 4 Template Extraction A template is a dependency parse-tree fragment, with variable slots at some tree nodes e.g., X < –subj– prevent –obj–> Y An entailment relation between two templates T1 and T2 holds if the meaning of T2 can be inferred from the meaning of T1. For example T1: ’Aspirin reduces heart attack’ can be inferred from T2: ’Aspirin prevents a first heart attack’ using the above template structure. The anchor set here is {Aspirin, heart attack} 4.1 Anchor Set Extraction In a sentence, the verb action is related to entities of subject and object. If we know the subject and object entities (or anchor sets), then we can say the event a sentence could probably be describing about. The goal of this phase is to find substantial number of promising anchor sets for each sentence. Usually these are the Noun Phrases of a sentence. A noun phrase can be a combination of common- nouns and proper nouns. If the proper nouns are once identified then the event or fact of the sentence can be judged. Initially on the output of the Stanford parser, the Anchor Set Extraction algorithm is applied. Our algorithm would capture all Noun Phrases of the sentence as the anchor sets, with more relevance given to the named-entities together as a different anchor set. The named-entities can be identified by applying the Stanford NER tool upon the dataset. 4.2 Template Formation The Template Extraction algorithm accepts as its input a list of anchor sets extracted from the ASE. Then, TE generates a set of syntactic templates from the dependency tree structure. This is applied only on Hypothesis Data. And the templates extracted are searched in the text for availability. For example: T: Madonna has three children. Output of ASE: {Madonna, three children} Now, using the dependency tree, the connecting element for these anchor sets is extracted, in order to get a relation between each of the anchor sets. After applying the TE algorithm Output Template(s): has=VBZ subject-Madonna=NNP objectthree=CD children=NNS The pos-tags are attached at the end, because it would help in the further modules to distinguish proper-nouns, common-nouns, cardinal numbers and verbs. Similarly if there are more anchor sets, there would be more templates best describing the dependency between the anchor sets. 5 Template Comparison After the templates from the Hypothesis are extracted, these templates are searched in the text information. For the text sentences, the ASE algorithm is applied and all the anchor sets are extracted. Now, for the hypothesis anchor sets, Anchor Set Match algorithm is applied, where each of the anchor sets is made to match with the anchor sets of the text sentences. Here, immense care is taken while matching the anchor sets. For Named-Entities (like {China, Chinese}, {Malay, Malaysia}) special rules are designed to cover majority of such cases. For nouns WordNet similarity tool is used; as the values are calculated in the preprocessing stage, there wouldnt be much time lapse while comparison. The threshold value was initially 0.75. For cardinal numbers rules are mentioned below. The output of this ASM algorithm would be the best matching anchor set in the text for every anchor set in the hypothesis. If any of the anchor set in the hypothesis doesnt found a matching anchor set, then the decision is said to be UNKNOWN. With the anchor set matching, the text and hypothesis are believed to be describing roughly the same event or fact. Now their verbs or modifier heads have to match to decide whether they describe the exact same event or not. The best matching anchor set from the text is retrieved. This text anchor set modifies the parent nodes in its dependency tree. All such nodes are extracted from the dependency tree. The Comparison Algorithm is applied to these nodes to compare with the modifier verb or noun of the corresponding hypothesis anchor set. The comparison algorithm would first look for a direct match of verb or noun in the list of nodes. If not, it uses the ’WordNet’ similarity values for noun comparisons. Because WordNet doesnt have enough database for the verbs, the ’VerbOcean’ is used which has a huge collections of verbs and different combinational values. But the Verbocean would only have the base form of the verbs. So, in order to convert the verbs to base form we used ’MontyLingua’ tool which would do the work. This algorithm would return if the event of the anchor sets matches or not. 6 Rules for Entailment Prediction • When the event of an anchor set is identified, it is verified with the events of the other anchor sets. If all other anchor sets satisfy the same event, then the verdict is ENTAILMENT. • The acronyms are expanded using the acronym database, so the acronyms are also matched with the expanded acronyms, and entailment is predicted accordingly. Ex: UK United Kingdom • Rules for Cardinal Numbers Matching (Numeric Named Entities) – The basic rule would first resolve the Numeric Named Entities into numeric values. Ex: 24 = 24; twenty-six = 26; 1L = 100000; 50,000 = 50000; 45 thousand =45000 – Next basic rule is to directly compare the numeric values of hypothesis and text. Then there would be quantification modifiers before or after the numeric entities. Ex: ’more than 100’ in Hypothesis should match ’at least 110’ in text. – The modification words are first considered and then the difference of the numeric values, to decide whether they match or not. 7 Rules for Contradiction Prediction • The basic rule would be considering the negation words such as – ’not’, ’n’t’, ’never’ etc. 8 Rules for Unknown Prediction • Whenever an anchor set from the hypothesis do not match any anchor set in the text, then it judge as UNKNOWN. • When the numbers along with their associated noun phrases from the hypothesis cannot be mapped to any associated noun phrase in the text, and then it is judged as UNKNOWN. • When a T-H pair doesn’t satisfy any of the above rules of entailment or contradiction, then the system assumes it to be an UNKNOWN case. 9 Results and Conclusion In TAC-2009, RTE-5 task has to predict the relation between 600 pairs of Text and Hypothesis. We submitted three runs for RTE 3-way task. In the first run, considering the comparison of nouns using the WordNet tool, the upper limit value is set to 0.75, which maintains a balance between generality and specificity. In case of VerbOcean there are two considerations, one are the verb synonyms with tag ’similar’, and other are verb antonyms with tag ’opposite-of’. For verb synonyms the threshold value is set to 10.5 (max value found was 25 approx). For verb antonyms the threshold value is set to 12. For the other two runs, experiments are carried out by varying the above mentioned threshold values. The results we obtained for the 3 runs submitted for RTE 3-way task are presented in Table 9 Accuracy Run1 46.8 Run2 46.8 Run3 46.83 Table 7: TAC 2009 results for 3 runs • In the VerbOcean database, the list of antonyms for each verb is hugely available. So, the antonyms can be detected using the VerbOcean database and the contradiction can be judged. • In cases when the events of the anchor sets match, but the numerical values, the numbers may mismatch. Such cases are predicted as Contradiction. Consider the following example killed nsubj=John obj=Lucy We obtained an accuracy of 60.66 for the RTE 2way task (evaluation was done upon the gold standard set) with the threshold values set as mentioned in the above paragraph. The confusion matrix for the 3-way task of run 3 with accuracy 46.83 is displayed in Table 8 Entailment Unknown Contradiction Entailment Unknown Contradiction Total 148 48 36 111 122 43 41 40 11 300 210 90 killed nsubj=Lucy obj=John Table 8: Confusion matrix • Here, the anchor sets have matched, and the event also are the same. But the roles are reversed. Such cases are declared as CONTRADICTION. Confusion matrix for 3-way task run-3 We got an accuracy of 46.83 as our best result. This year, our approach was more concentrated upon extracting templates and comparing the anchor sets, which we think has covered more cases than the approach we followed previously. 10 Results and Analysis of Ablation tests In the system we built, when two words are required to be analyzed for comparison, we have considered the following cases for the categorization of those words 1)Both the words are nouns 2)At least one of them is a verb 3)At least one of them is an acronym. We used different tools for each of the cases mentioned above. In the first case when the similarity of nouns has to be calculated, we used the similarity tool WordNet. While in second case for the comparison of a verb with any other word, we used VerbOcean along with MontyLingua tool. And for the third case we used the acronym data set available at the acronym-guide web page, which was suggested in the RTE knowledge resources. 10.1 Experimental Setting For all the ablation tests the experimental settings was the same. And they were run upon the experimental settings of the run-3 of 3-way task. The threshold value for the similarity of nouns using WordNet tool is set as 0.75. And the threshold value was 10.5 for comparing the synonyms verbs and the antonym verbs using Verbocean. then we used VerbOcean. But, the VerbOcean has all the verbs in their base form. So, there is a need to convert the verb into its base form. For that purpose we used MontyLingua tool. When this is removed, the verb is checked in the VerbOcean in its original form. So, in this ablation test, we removed the tool and tested with the same experimental setting. We obtained zero change in the accuracies of 2-way and 3-way task. Accuracy for 2-way task was 0.6067 and the accuracy for 3-way was 0.4683. 10.2.3 Test-3 Tool ablated: VerbOcean Here the verb comparison tool VerbOcean is removed. When this is done, those words are compared based on string comparison techniques mentioned in the case of wordnet ablation test. Here too, we obtained zero change in the accuracies of 2-way and 3-way task. Their accuracies remained the same. 10.2.4 Test-4 Tool ablated: Acronyms set The acronyms data set downloaded from the acronymguide is removed in this ablation test. Whenever two words are compared, it is checked if any of those words is an acronym. If so, its expanded form is retrieved from the acronym set if available, then they are compared. If a word is embedded in the expanded set of words, it is said to be similar. So, if this is removed, the acronyms are compared as such, after removing the punctuations. There was no change in the accuracies of 2-way or 3-way task in this ablation test. References 10.2 10.2.1 Ablation Tests Test-1 Tool Ablated: WordNet In this ablation test, the similarity tool WordNet is ablated. If this tool is removed, then the two nouns are compared using only the string comparison techniques. Given two words, if they are similar to a certain string length (starting from the first letter), proportionate to their lengths, then they are said to be similar. Results obtained for this test 2way accuracy = 0.6033; Change in 2way Accuracy = 0.01; 3 way accuracy = 0.47; Change in 3way accuracy = 0.0017 10.2.2 Test-2 Tool Ablated: MontyLingua When a verb needs to be compared with another word, [1] L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. [2] S. Deorowicz and M. Ciura. Correcting spelling errors by modelling their causes. International journal of applied mathematics and computer science, 15(2):275, 2005. [3] H. Edmundson. New methods in automatic extracting. pages 268– 285. In Journal of ACM, Volume 16, 1969. [4] S. Fisher and B. Roark. Query-focused summarization by supervised sentence ranking and skewed word distributions. In In proceedings of DUC 2006. DUC, 2006. [5] S. R. Gunn. Support vector machines for classification and regression. May 1998. [6] R. He, Y. Liu, B. Qin, T. Liu, and S. Li. Hitirs update summary at tac2008:extractive content selection for language independence. In TAC 2008 Proceedings. Text analysis conference, December 2008. [7] J. Jagarlamudi, P. Pingali, and V. Varma. Query independent sentence scoring approach to duc 2006. In In proceedings of DUC 2006. DUC, 2006. [8] R. Katragadda, P. Pingali, and V. Varma. Sentence position revisited: A robust light-weight update summarization baseline algorithm. In Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3), pages 46–52, Boulder, Colorado, June 2009. Association for Computational Linguistics. [9] J. kupeic, J. pedersen, and F. chen. A trainable document summarizer. In In proceedings of ACM SIGIR 95, pages 68–73. ACM, 1995. [10] S. Li, Y. Ouyang, W. Wang, and B. Sun. Multi-document summarization using support vector regression. In DUC 2007 notebook, 2007. Document Understanding Conference, November 2007. [11] Lin and Chin-Yew. Looking for a few good metrics: Automatic summarization evaluation - how many samples are enough? In Proceedings of the NTCIR Workshop 4, June 2004. [12] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. [13] J. M.Conro, J. D. schlesinger, J. Goldstein, and D. P.O’leary. Leftbrain/right-brain multi-document summarization. In In proceedings of DUC 2004, 2004. [14] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceedings of NAACL HLT, volume 2007, 2007. [15] A. Nenkova, R. Passonneau, and K. McKeown. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process., 4(2):4, 2007. [16] F. schilder and R. Kondadandi. Fastsum: fast and accurate querybased multi-document summarization. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. Human Language Technology Conference, 2008. [17] D. shen, J.-T. sun, H. Li, Q. yang, and Zheng. Document summarization using conditional random fields. In In proceedings of IJCAI’07, pages 2862–2867. IJCAI, 2007. [18] T. Zesch, I. Gurevych, and M. Muhlha user. Analyzing and accessing Wikipedia as a lexical semantic resource. Data Structures for Linguistic Resources and Applications, pages 197–205, 2007. [19] J. Zhang, X. Cheng, H. Xu, X. Wang, and Y. Zeng. Summarizing dynamic information with signature terms based content filtering. In TAC 2008 Proceedings. Text analysis conference, December 2008. [20] ziheng Lin, H. H. Hoang, L. Qiu, S. Ye, and M.-Y. Ka. Nus at tac 2008:augmenting timestamped graphs with event information and selectively expanding opinion contents. In TAC 2008 Proceedings. Text analysis conference, December 2008.