Academia.eduAcademia.edu

Web-Based Lemmatisation of Named Entities

2008

Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.

Web-Based Lemmatisation of Named Entities Richárd Farkas1, Veronika Vincze2 , István Nagy2 , Róbert Ormándi1, György Szarvas1, and Attila Almási2 MTA-SZTE, Research Group on Artificial Intelligence, 6720 Szeged, Aradi Vértanúk tere 1., Hungary, University of Szeged, Department of Informatics, 6720 Szeged, Árpád tér 2., Hungary {rfarkas,vinczev,ormandi,szarvas}@inf.u-szeged.hu, [email protected], [email protected] Abstract. Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%. Keywords: Lemmatisation, web-based techniques, named entity recognition. 1 Introduction This paper seeks to lemmatise Named Entities (NEs) in English and in Hungarian. Finding the lemma (and inflectional affixes) of a proper noun can be useful for several reasons: the proper name can be stored in a normalised form (e.g. for indexing) and it may prove to be easier to classify a proper name in the corresponding NE category using its lemma instead of the affixed form. Next, the inflectional affixes themselves can contain useful information for some specific tasks and thus can be used as features for classification (e.g. the plural form of an organisation name is indicative of org-for-product metonymies and is a strong feature for proper name metonymy resolution [1]). It is useful to identify the phrase boundary of a certain NE to be able to cut off inflections. This is not a trivial task in free word order languages where NEs of the same type can follow each other without a punctuation mark. The problem is difficult to solve because, unlike common nouns, NEs cannot be listed. Consider, for instance, the following problematic cases of finding the lemma and the affix of a NE: 1. the NE ends in an apparent suffix, 2. two (or more) NEs of the same type follow each other and they are not separated by punctuation marks, 3. one NE contains punctuation marks within. P. Sojka et al. (Eds.): TSD 2008, LNAI 5246, pp. 53–60, 2008. c Springer-Verlag Berlin Heidelberg 2008  54 R. Farkas et al. In morphologically rich languages such as Hungarian, nouns (hence NEs as well) can have hundreds of different forms owing to grammatical number, possession marking and grammatical cases. When looking for the lemmas of NEs (case 1), the word form being investigated is deprived of all the suffixes it may bear. However, there are some NEs that end in an apparent suffix (such as McDonald’s or Philips in English or Pannon in Hungarian, with -on meaning “on”), but this pseudo-suffix belongs to the lemma of the NE and should not to be removed. Such proper names make the lemmatisation task non-trivial. In the second and third cases, when two NEs of the same type follow each other, they are usually separated by a punctuation mark (e.g. a comma). Thus, if present, the punctuation mark signals the boundary between the two NEs. However, the assumption that punctuation marks are constant markers of boundaries between consecutive NEs and that the absence of punctuation marks indicates a single (longer) name phrase often fails, and thus a more sophisticated solution is necessary to locate NE phrase boundaries. Counterexamples for this naive assumption are NEs such as Stratford-upon-Avon, where two hyphens occur within one single NE. In order to be able to select the appropriate lemma for each problematic NE, we applied the following strategy. Each ending that seems to be a possible suffix is cut off the NE. With those NEs that consist of several tokens either separated by punctuations marks or not, every possible cut is performed. Then Google and Yahoo searches are carried out on all the possible lemmas. We trained several machine learning classifiers to find the decision boundary for appropriate lemmas based on the frequency results of the search engines. 2 Related Work Lemmatisation of common nouns. Lemmatisation, that is, dividing the word form into its root and suffixes, is not a trivial issue in morphologically rich languages. In agglutinative languages such as Hungarian, a noun can have hundreds of different forms owing to grammatical number, possession marking and a score of grammatical cases: e.g. a Hungarian noun has 268 different possible forms [2]. On the other hand, there are lemmas that end in an apparent suffix (which is obviously part of the lemma), thus sometimes it is not clear what belongs to the lemma and what functions as a suffix. However, the lemmatisation of common nouns can be made easier by relying on a good dictionary of lemmas [3]. Lemmatisation of NEs. The problem of proper name lemmatisation is more complicated since NEs cannot be listed exhaustively, unlike common nouns, due to their diversity and increasing number. Moreover, NEs can consist of several tokens, in contrast to common nouns, and the whole phrase must be taken into account. Lots of suffixes can be added to them in Hungarian (e.g. Invitelben, where Invitel is the lemma and -ben means “in”), and they can bear the plural or genitive marker -s or ’s in English (e.g. Toyotas). What is more, there are NEs that end in an apparent suffix (such as McDonald’s), but this pseudo-suffix belongs to the lemma of the NE. NE lemmatisation has not attracted much attention so far because it is not such a serious problem in major languages like English and Spanish as it is in agglutinative Web-Based Lemmatisation of Named Entities 55 languages. An expert rule-based and several string distance-based methods for Polish person name inflection removal were introduced in [4]. And a corpus-based rule induction method was studied for every kind of unknown word in Slovene in [5]. The scope of our study lies between these two as we deal with different kinds of NEs. On the other hand, these studies focused on removing inflection suffixes, while our approach handles the separation of consecutive NEs as well. Web-based techniques. Our main hypothesis is that the lemma of an NE has a relatively high frequency on the World Wide Web, in contrast to the frequency of the affixed form of the NE. Although there are several papers that tell us how to use the WWW to solve simple natural language problems like [6,7], we think that this will become a rapidly growing area and more research will be carried out over the next couple of years. To the best of our knowledge, there is currently no procedure for NE lemmatisation which uses the Web as a basis for decision making. 3 Web-Based Lemmatisation The use of online knowledge sources in Human Language Technology (HLT) and Data Mining tasks has become an active field of research over the past few years. This trend has been boosted by several special and interesting properties of the World Wide Web. First of all, it provides a practically limitless source of (unlabeled) data to exploit and, more importantly, it can bring some dynamism to applications. As online data change and rapidly expand over time, a system can remain up-to-date and extend its knowledge without the need for fine tuning or any human intervention (like retraining on up-to-date data). These features make the Web a very useful source of knowledge for HLT applications as well. On the other hand, accessing the data is feasible just via search engines (e.g. we cannot iterate through all of the occurrences of a word). There are two interesting problems here: first, appropriate queries must be sent to a search engine; second, the response of the engine offers several opportunities (result frequencies, snippets, etc.) in addition to simply “reading” the pages found. In the case of lemmatisation we make use of the result frequencies of the search engines. In order to be able to select the appropriate lemma for each problematic NE, we applied the following strategy. In step-by-step fashion, each ending that seems to be a possible suffix is cut off the NE. As for the apparently multiple NEs separated by punctuation marks or not, every possible cut is performed; that is, in the case of Stratford-upon-Avon, the possibilities Stratford + upon + Avon (3 lemmas), Stratfordupon + Avon, Stratford + upon-Avon (2 lemmas), Stratford-upon-Avon (1 lemma) are generated. Then after having found all the possible lemmas we do a Google search and a Yahoo search. Our key hypothesis is that the frequency of the lemma-candidates on the WWW is high – or at least the ratio of the full form and lemma-candidate frequencies is relatively high – with an appropriate lemma and low in incorrect cases. In order to verify our hypothesis we manually constructed four corpora of positive and negative examples for NE lemmatisation. Then, Google and Yahoo searches were performed on every possible lemma and training datasets were compiled for machine learning algorithms using the queried frequencies as features. The final decision rules were learnt on these datasets. 56 R. Farkas et al. 4 Experiments In this section we describe how we constructed our datasets and then investigate the performance of several machine learning methods in the tasks outlined above. 4.1 The Corpora The lists of negative and positive examples for lemmatisation were collected manually. We adopted the principal rule that we had to work on real-world examples (we did not generate fictitious examples), so our three annotators were asked to browse the Internet and collect “interesting” cases. These corpora are the unions of the lists collected by 3 linguists and were checked by the chief annotator. The samples mainly consist of person names, company names and geographical locations occurrences on web pages. Table 1 lists the size of each corpora (constructed for the three problems). The corpora are accessible and are free of charge. The first problematic case of finding the lemma of a NE is when the NE ends in an apparent suffix (which we will call the suffix problem in the following). In agglutinative languages such as Hungarian, NEs can have hundreds of different inflections. In English, nouns can only bear the plural or the possessive marker -s or ’s. There are NEs that end in an apparent suffix (such as Adidas in English), but this pseudo-suffix belongs to the lemma of the NE. We decided to build two corpora for the suffix problem; one for Hungarian and one for English and we produced the possible suffix lists for the two languages. In Hungarian more than one suffix can be matched to several phrases. In these cases we examined every possible cut and the correct lemma (chosen by a linguist expert) became a positive example, while every other cut was treated as a negative one. The other two lemmatisation tasks, namely the case where several NEs of the same type follow each other and they are not separated by punctuation marks and the case where one NE contains punctuation marks within are handled together in the following (and is called the separation task). When two NEs of the same type follow each other, they are usually separated by a punctuation mark (e.g. a comma or a hyphen). Thus, if present, the punctuation mark signals the boundary between the two NEs (e.g. ArsenalManchester final; Obama-Clinton debate; Budapest-Bécs marathon). However, the assumption that punctuation marks are constant markers of boundaries between consecutive NEs and that the absence of punctuation marks indicates a single (longer) name phrase often fails, and thus a more sophisticated procedure is necessary to locate NE phrase boundaries. Counterexamples for this naive assumption are NEs such as the Saxon-Coburg-Gotha family, where the hyphens occur within the NE, and sentences such as Gyurcsány Orbán gazdaságpolitikájáról mondott véleményt. (’Gyurcsány expressed his views on Orbán’s economic policy.’ (two consecutive entities) as opposed to “He expressed his views on Orbán Gyurcsány’s economic policy.” (one single twotoken-long entity)). Without background knowledge of the participants in the presentday political sphere in Hungary, the separation of the above two NEs would a pose problem. Actually, the first rendition of the Hungarian sentence conveys the true, intended meaning; that is, the two NEs are correctly separated. As for the second version, the NEs are not separated and are treated as a two-token-long entity. In Hungarian, Web-Based Lemmatisation of Named Entities 57 Table 1. The sizes of the corpora positive examples negative examples Suffix Eng Suffix Hun Separation Eng Separation Hun 74 207 51 137 84 543 34 69 however, a phrase like Gyurcsány Orbán could be a perfect full name, Gyurcsány being a family name and Orbán being – in this case – the first name. As consecutive NEs without punctuation marks appear frequently in Hungarian due to the free word-order, we decided to construct a corpus of negative and positive cases for Hungarian. Still, such cases can occur just as the consequence of a spelling error in English. Hence we focused on special punctuation marks which can separate entities in several cases (Obama-Clinton debate) but are part of the entity in others. In the separation task there are several cases where more than one cut is possible (more than two tokens in Hungarian and more than one special mark in English). In such cases we again asked a linguist expert to choose the correct cut and every incorrect cut became a negative example. 4.2 The Feature Set To create training datasets for machine learning methods – which try to learn how to separate correct and incorrect cuts based on labeled examples – we sent queries to the Google and Yahoo search engines using their APIs1 . The queries started and finished in quotation marks and the site:.hu constraint was used in the Hungarian part of the experiments. In the suffix tasks, we sent queries with and without suffixes to both engines and collected the number of hits. The original database contained four dimensional feature vectors. Two dimensions listed the number of Google hits and two components listed similar values from the Yahoo search engine. The original training datasets for the separation tasks contained six features. Two stood for the number of Google hits for the potential first and second parts of a cut and one represented the number of Google hits for the whole observed phrase. The remaining three features conveyed the same information, but here we used the Yahoo search engine. Each of the four tasks was a binary classification problem. The class value was set to 0 for negative examples and 1 for positive ones. Our preliminary experiments showed that using just the original form of the datasets (Non-Rate rows of Table 2) is not optimal in terms of classification accuracy. Hence we performed some basic transformations on the original data, the first component of the feature vector was divided by the second component. If the given second component was zero, then the new feature value was also zero (Rate rows in Table 2). This yielded a two dimensional dataset when we utilised both Yahoo and Google hits for the suffix classification tasks and a four dimensional for the separation tasks. Finally, we took the minimum and the maximum of the separated parts’ ratios hence providing the possibility to learn rules such as “if one of the separated parts’ frequency ratio is higher than X”. 1 Google API: http://code.google.com/apis/soapsearch/; http://developer.yahoo.com/search/ Yahoo API: 58 R. Farkas et al. 4.3 Machine Learning Methods In the tests, we experimented with several machine learning algorithms. Each method was evaluated using the 10-fold-cross-validation method. Here classification accuracy was used as the evaluation metric in each experiment. We employed the WEKA [8] machine learning library to train our models. Different kinds of learning algorithms and parameter settings were examined to achieve the highest possible accuracy. The first one is a baseline method which is the naive classifier; it classifies each sample using the most frequent label observed on the training dataset. We used the k-Nearest Neighbour [9] algorithm, which is one of the simplest machine learning algorithms. In this algorithm, an object is classified by taking the majority vote of its neighbours, with the object being assigned to the class most common amongst its k nearest neighbours. Here k is a positive integer, and it is typically not very large. We also used the C4.5 [10] decision tree learning algorithm as it is a widely applied method in data mining. In a decision tree each interior node corresponds to an attribute; an edge to a child represents a possible value or an interval of values of that variable. A leaf represents the most probable class label given the values of the variables represented by the path from the root. A tree can be learned by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner. The recursion stops either when further splitting is not possible or when the same classification can be applied to each element of the derived subset. The M parameter of the C4.5 defines the minimum number of instances per leaf, i.e. the there are no more splits on nodes if the number of instances is fewer than M. Logistic Regression [11] is another well studied machine learning method. It seeks to maximise the conditional probability of classes, subject to feature constraints (observations). This is performed by weighting features so as to maximise the likelihood of data. The results obtained by using the different feature sets and the machine learning algorithms described above are presented in Table 2 for the suffix task, and in Table 3 for the separation task. 4.4 Discussion For the English suffix task, the size of the databases is quite small, which leads to a dominance of the k-Nearest Neighbour classifier, since the lazy learners – like k-Nearest Neighbour – usually achieve good results on small datasets. In this task the training algorithms attain their best performance on the transformed datasets using rates of query hits (this holds when Yahoo or Google searches were performed). One could say that the rate of the hits (one feature) is the best characterisation in this task. However, we can see that with the Hungarian suffix problem the original dataset characterises the problem better, and thus the transformation is really unnecessary. The best results for the Hungarian suffix problem are achieved on the full dataset, but they are almost the same as those for untransformed Yahoo dataset. Without doubt, this is due to the special property of the Yahoo search engine which searches accent sensitively, in contrast to Google. For example, for the query Ottó Google finds every webpage which contains Ottó and Otto as well, while Yahoo just returns the Ottó-s. Web-Based Lemmatisation of Named Entities 59 Table 2. Suffix task results obtained from applying different learning methods. (The first letter stands for the language /E – English; H – Hungarian/. The second letter stands for the search engine used /B – Both; G – Google; Y – Yahoo/. NR means Non-Rate database, R means Rate database. Thus for example, H-B-NR means Hungarian problem using non-rate dataset and both search engines). E-B-NR E-B-R E-G-NR E-G-R E-Y-NR E-Y-R H-B-NR H-B-R H-G-NR H-G-R H-Y-NR H-Y-R kNN k=3 kNN k=5 C4.5 M=2 C4.5 M=5 Log Reg 89.24 89.24 86.71 84.18 84.81 93.04 89.87 91.77 92.41 73.42 87.34 86.71 87.34 81.65 82.28 93.67 93.67 87.97 92.41 90.51 89.87 89.87 86.08 85.44 84.18 91.77 91.77 87.34 91.77 88.61 94.27 94.27 82.67 90.00 88.27 84.67 86.13 81.73 81.73 72.40 85.33 85.33 82.40 82.93 83.33 83.60 78.40 83.60 77.60 77.60 93.73 93.73 83.87 88.13 86.13 87.20 84.27 87.20 74.00 74.00 Base 53.16 53.16 53.16 53.16 53.16 53.16 72.40 72.40 72.40 72.40 72.40 72.40 Table 3. Separation task results obtained from applying different learning methods English Hungarian kNN k=3 kNN k=5 C4.5 M=2 C4.5 M=5 Log Reg Base 88.23 84.71 95.29 94.12 77.65 60.00 79.25 81.13 80.66 79.72 70.31 64.63 The separation problem for Hungarian proved to be a difficult task. The decision tree (which we found to be the best solution) is a one-level high tree with a split. This can be interpreted as if one of the resulting parts’ frequency ratio is high enough, then it is an appropriate cut. It is interesting to see that among the learned rules for the English separation task, there is a constraint for the second part of a possible separation (while the learnt hypothesis for Hungarian consisted of simple if (any) one of the resulting parts is. . . rules). 5 Conclusions In this paper we introduced corpora for the English and Hungarian Named Entity lemmatisation tasks. The corpora are freely available for further comparison studies.The NE lemmatisation task is very important for textual data indexing systems, for instance, and is of great importance for agglutinative (Finno-Ugric) languages and other languages that are rich in inflections (like Slavic languages). To handle this task, we proposed a web-based heuristic which sends queries for every possible lemma to two popular web search engines. Based on the constructed corpora we automatically derived simple decision rules on the search engine responses by applying machine learning methods. 60 R. Farkas et al. Despite the small size of our corpora, we got fairly good empirical results and even better accuracies can probably be obtained with bigger corpora. The 91.09% average accuracy score supports our hypothesis that even the frequencies of possible lemmas provide enough information to help find the right separations. We encountered several problems when using search engines to obtain the lemma-frequencies like, the need of an accent sensitive search, difficulties in the proper handling of punctuation marks in queries and that the frequency values of results are estimated values only. We think that if offline corpora with an appropriate size were available, the frequency counts would be more precise and our heuristic could probably attain even better results. Acknowledgments. This work was supported in part by the NKTH grant of Jedlik Ányos R&D Programme 2007 of the Hungarian government (codename TUDORKA7). The author wishes to thank the anonymous reviewers for valuable comments. References 1. Farkas, R., Simon, E., Szarvas, Gy., Varga, D.: Gyder: Maxent metonymy resolution. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, pp. 161–164. Association for Computational Linguistics (2007) 2. Melĉuk, I.: Modèle de la déclinaison hongroise. In: Cours de morphologie générale (théorique et descriptive), Montréal, Les Presses de l’Université de Montréal, CNRS (edn). vol. 5, pp. 191–261 (2000) 3. Halácsy, P., Trón, V.: Benefits of resource-based stemming in Hungarian information retrieval. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 99–106. Springer, Heidelberg (2007) 4. Piskorski, J., Sydow, M., Kupść, A.: Lemmatization of polish person names. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing, Prague, Czech Republic, pp. 27–34. Association for Computational Linguistics (2007) 5. Erjavec, T., Dzeroski, S.: Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence 18, 17–41 (2004) 6. Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006) 7. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005) 8. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005) 9. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991) 10. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 11. Berger, A.L., Pietra, S.D., Pietra, V.J.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22, 39–71 (1996)