Academia.eduAcademia.edu

LXService: Web Services of Language Technology for Portuguese

2008, … of LREC2008. ELRA, …

In the present paper we report on the development of a cluster of web services of language technology for Portuguese that we named as LXService. These web services permit the direct interaction of client applications with language processing tools via the Internet. This way of making ...

LXService: Web Services of Language Technology for Portuguese António Branco, Francisco Costa, Pedro Martins, Filipe Nunes, João Silva, Sara Silveira University of Lisbon Dep. Informática, Faculdade de Ciências de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal {Antonio.Branco, fcosta, pmartins, fnunes, jsilva, sara.silveira}@di.fc.ul.pt http://nlx.di.fc.ul.pt Abstract In the present paper we report on the development of a cluster of web services of language technology for Portuguese that we named as LXService. These web services permit the direct interaction of client applications with language processing tools via the Internet. This way of making available language technology was motivated by the need of its integration in an eLearning environment. In particular, it was motivated by the development of new multilingual functionalities that were aimed at extending a Learning Management System and that needed to resort to the outcome of some of those tools in a distributed and remote fashion. This specific usage situation happens however to be representative of a typical and recurrent set up in the utilization of language processing tools in different settings and projects. Therefore, the approach reported here offers not only a solution for this specific problem, which immediately motivated it, but contributes also some first steps for what we see as an important paradigm shift in terms of the way language technology can be distributed and find a better way to unleash its full potential and impact. 1. Introduction In this paper we present the achievements obtained in the development of a cluster of web services of language technology for Portuguese. The development of these web services started in the scope of the LT4eL-Language Technology for e-Learning project. They are supported by a range of language technology tools that have been developed in the past at the University of Lisbon in the scope of a number of previous projects. The LT4eL project aims at using multilingual language technology tools and semantic web techniques for supporting e-Learning activities (http LT4eL). The developed technology will facilitate access to learning objects and will support decentralization and cooperation in content management. To a large extent, this is achieved by extending Learning Management Systems with new, language technology based functionalities, namely a keyword extractor a glossary extractor and a semantic search tool. At some point in their internal workflow, these new functionalities resort to the outcome of shallow language processing tools ranging from sentence chunkers to lemmatizers and including POS taggers and morphological analyzers. The design constraints impinging on this access set a clear case for the deployment of web services that should be able to deliver the expected outcome of those tools: ƒ new applications or functionalities build on the outcome of other subordinated applications or tools; ƒ new versions of the subordinated tools with improved performance should be seamlessly made available with no extra development effort on the side of the client application; ƒ the licensing of copies of the subordinate tools is not an option available for the stakeholder of the new client application, but the licensing of the outcome of those tools, as delivered by a remote internet-based service, can be. Though immediately motivated by a specific need to made available language technology for Portuguese for a particular purpose, the work reported in the present paper helps to shed light on what we deem to be an important paradigmatic shift in the way language technology is made available, and most likely in terms of the overall impact of this technology with respect to its neighbouring areas. This is the issue addressed in the next Section 2. In Section 3, we introduce the processing tools for Portuguese that are being resorted to in order to support the deployment of the web services. The web services and their development are described in Section 4. In Section 5, we discuss the work in progress that is currently being undertaken, and finally in Section 6 we present concluding remarks. 2. Language technology via web services From the initial design plan up to the final users, the production chain of language resources encompasses several stages, where development tools, validation methodologies, or encoding standards, among many other 2577 instruments, are called to play a role. Aiming at improved production chains, – either in terms of efficiency or in terms of the quality of their outcome –, almost all such stages and supporting instruments have been under extensive development in the recent past. This is attested, for instance, by the burgeoning activity being reported in the LREC conferences, to refer just one of several signs of evidence. One of the key stages in the language resources production chain is the distribution step. This step involves non negligible issues, ranging from intellectual property to versioning aspects just to refer a few, that in their more structured setups, have found important support by distribution agencies such as the European Language Resources Association (http ELRA) or the Linguistic Data Consortium (http LDC). Also this stage of distribution is receiving closer attention and being developed so that it may get raised to a new level of organization that improves its added value, by exploiting the full potential of web technologies. Fostered by projects like DAM-LR (http DAM-LR) or initiatives like CLARIN (http CLARIN) and Language Grid (http Language Grid), what is being pursued is the distribution of language resources via web services that grant seamless access to human users and, crucially, also to client applications — by automatically handling their authentication and authorization privileges, tuning to their profile and preferences, managing licensing issues, selecting convenient personalized versions, etc. While this reshuffling of the distribution of language resources will be taking shape in the near future, the direction of its evolution is inspiring enough to induce similar reshuffling trends in neighbouring areas, in particular in what concerns language technology. In this respect, the granting of access to the language technology tools as web services may be the key to bringing the distribution stage of these tools to a level of dissemination similar to the one achieved for language resources. In this connection, it is reasonable to anticipate that the availability of language technology tools as web services will have an even deeper impact than a similar move will have on the side of language resources. In tandem with the web service-oriented distribution of resources, this can be the stepping stone for a whole new stage of more widespread incorporation of natural language technology in the semantic web itself. Note that the key issue here is not on taking the language technology as a component of semantic web services, but on the language processing tools being supplied under the form of web services themselves to client applications (which of course will enhance their availability as components for semantic web services in other domains). In particular, and narrowing our focus of attention to our field, this new form of availability of language technology tools will have a major positive feedback in the production chain of language resources itself. It will enhance more rapid and accurate production of language resources, very likely by several orders of magnitude. 3. Language technology tools Against the background discussed in the previous Section, and having as immediate motivation the integration of our language technology tools in the LT4eL project, we worked towards making a range of tools for Portuguese available under this new approach. These tools are a subset of the tools that we have been developing for the processing of Portuguese and they were selected because they satisfy a number of features that are likely to make them more suitable for initial experimentation: They are fast, robust, the linguistic information in their output is well understood, and they perform at state of the art accuracy. They include the following individual tools, covering analysis and generation procedures: Sentence chunker: detects and marks paragraph and sentence boundaries; 99.94% accuracy Tokenizer: segments text into tokens, expands contractions, detaches clitic pronouns from verbs, etc.; 99.72% accuracy POS tagger: assigns POS tags to tokens in context; 96.87% accuracy Nominal featurizer: assigns inflection features (gender and number) to words from the nominal POS categories, resolving ambiguity in context; 91.07% f-score. Nominal lemmatizer: assigns a lemma to words from the nominal POS categories (viz. common nouns and adjectives), resolving ambiguity in context; 97.67% Verbal featurizer and lemmatizer: assigns inflection features (tense, person and number) and lemma (infinitive form) to verbs, resolving ambiguity in context; 95.96% f-score. Verbal conjugator: delivers a conjugation table given any (attested or putative) infinitive verb form and the specification of possible associated clitics Nominal inflector: delivers an inflected form given another (attested or putative) form and the inflected features required for the output For a detailed description, analysis and evaluation results of these tools, see (Branco and Silva, 2004, 2006; Silva, 2007; Nunes, 2007). 4. Web services A web service is a software application that, crucially, supports direct interaction with other software applications over the Internet (Alonso et al., 2005, Ch.5). In its current stage of development, the LXService for language technology of Portuguese is a web service that already 2578 supports the direct interaction of their clients with three of the above listed tools via the Internet, namely: the sentence chunker, the tokenizer and the POS tagger. Basically, given an input text, the clients can interact with these tools via the web service in order to obtain the outcome produced by them, that is a new version of the input text after its being linguistically annotated up to the level that these tools can ensure. The access to each one of the tools is granted via different methods — String chunks(Sring text), String tokenizes(String text)and String posTags(String text) — that are members of the Java class LXClient. These methods are invoked over an object of this class. Its constructor requires one parameter related to the authentication of the client before the LXService, namely the client’s username as this is registered at the LXService database of clients. This class is part of the package pt.ul.fc.di.nlx.lxServiceClient, which includes also the classes for Exceptions and for the class responsible for the authentication before the LXService. Hence, for making an application that is a client of LXService to work, one just needs to get hold of a copy of this package, after the authorization for the utilization of the service being granted and a username and password had been assigned. A summary of the API for the key client class of the LXService, LXClient, is displayed in Annex A. At the service endpoint, in turn, the web services offered by the LXService are implemented in a Java class whose methods are remotely invoked by clients and that locally call the appropriate chain of processing tools on the server side. These services run on a web server supported by Apache Tomcat. To implement such services, the SOAP (Simple Access Object Protocol) was used to transfer the data between them and their clients, in particular with the implementation provided by Apache Axis. Local copies of the Axis libraries at the client endpoint are thus also required, together with the pt.ul.fc.di.nlx.lxServiceClient package. The authentication is ensured by the implementation of Web Security Service, provided by Apache WSS4J, which is endorsed by the OASIS security standard. The key functionality of a web service — that it supports direct interaction among applications over the Internet, – however, is not in conflict with the fact that further, online services may be offered to human users as well. Accordingly, in order to enhance their visibility and extend their added value, in previous projects we have been progressively rendering them also under a presentation layer that permits their usage directly by human users, e.g. to support second language learning, basic linguistic research, etc. Hence, the tools presented above were bundled into four suites of self-contained functionality as described below, whose front pages are depicted in Annex B: LX-Inflector, http://lxinflector.di.fc.ul.pt This service allows its clients to obtain the inflected form from any nominal form provided (common nouns or adjectives — neologisms included), and according to a given specification of inflection feature values entered. This functionality is supported by the composition of the nominal featurizer, lemmatizer and inflector tools indicated above. These tools are implemented in C and Java, and combined through a shell script run by JSP, which interfaces this service with its clients. LX-Lemmatizer, http://lxlemmatizer.di.fc.ul.pt This service provides fully-fledged lemmatization of Portuguese verbs: given a verb form (neologisms included), it delivers all possible lemmata together with their respective inflection features. The supporting verbal lemmatization tool was implemented in Java. It is invoked by the clients via JSP. LX-Conjugator, http://lxconjugator.di.fc.ul.pt This service provides fully-fledged conjugation tables for any verb form entered (neologism included), including the full range of pronominal forms. The underlying conjugator tool is implemented in Python, which is invoked via CGI. LX-Suite, http://lxsuite.di.fc.ul.pt This service is supported by the composition of the range of processing tools for analysis: sentence chunker, tokenizer, POS tagger, nominal featurizer, nominal lemmatizer and verbal lemmatizer. The resulting functionality ensures that the input, entered as raw text, is sentence and token-segmented, their tokens are associated with corresponding lemmata and tagged with linguistic information on their POS and inflection feature values. The outcome is resolved with respect to ambiguity that arise at the different levels of processing. Most of these tools are implemented in C. They are pipelined together through a shell script that is invoked by clients via CGI. 5. Work in progress The web services just described were put together by resorting to mature web technologies. While offering the expected robustness to the services supported, these technologies fall short of addressing some of the key issues that emerging technologies for web services are aimed at addressing. As of now: the location of these services, as well as their functionality were made known via postings to email lists of communities whose users have interests that are close to language technology — which imposes that client applications still need to be tuned “by hand” if they are to resort to them. Very interestingly, and encouraging, even under this first setup, and without an API yet that can be resorted to 2579 for the principled invocation of some of these tools, their experimental version as online versions for human has not hampered that these versions have already been accessed by client applications. Apparently, their programmers have found that the added value offered by these services are worth enough to justify going into the work of getting at their outcome via bare scrapping of the HTML document it is embedded into. At the moment of printing the present paper, the LXService is planned to be expanded so that further methods are included in the client LXClient.class which grant access to the remaining processing tools listed above, namely those concerned with lemmatization and morphological analysis. This is a step to be taken with the benefit of the experience and feedback gathered with the running of the current version of the implemented services. LXService is also being worked out in order to evolve towards further stages that integrate technologies and standards that are emerging specifically to support web services. This includes the utilization of WSDL language to describe these services and allow for their automatic finding (in tandem with future mature UDDI registers). 6. Concluding remarks Given the experience gained so far, and even only with the current preliminary version, a couple of key issues concerning the viability of the sought for paradigm of language technology tools as web services have emerged. And they are very likely to persist even through the upcoming versions that integrate the above mentioned specific technologies for web services. As can be checked upon using the LX-Suite online service, our tools are delivering their core outcome in an internally defined format. It will be straightforward to convert from this format into any other format established as a standard (expectedly on top of XML). In spite of its utmost convenience, and of previous and current efforts, no encoding schema for linguistic information associated to natural language expressions has gained momentum as a de facto standard. Another issue concerns authentication and access by trusted clients and copy rights management. As the language technology tools as web services paradigm expands and gains momentum (or even to make this happen), it is likely that providers of high quality services will be willing to join in case they can’t hold control over these aspects. Broadly taken, these types of questions are not restricted to the class of web services we are establishing. They appear to be the incarnation in this realm of language technology of similar problems that the deployment of service-oriented computation is facing (Alonso et al., 2005, Ch.9). Among the different routes that are envisaged as helpful to overcome these problems, there is one that we think may well be suited for our realm of language resources and technology. One of the possible ways out is that a major player, given its dimension and importance, induces the establishment of de facto encoding standards (the schemas that that entity eventually adopts). Moreover, if that entity is reliable and independent from service providers and consumers, it will be accepted as a trusted broker for the contracting of services among interested parties and for the certifying of copy rights compliant transactions. The research infra-structure whose creation is a major goal of CLARIN is a most prominent candidate to evolve towards an entity that can play such a role. 7. References Branco, António and João Silva, 2004, "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese". Proceedings LREC2004. Branco, António and João Silva, 2006, "Dedicated Nominal Featurization of Portuguese", Lecture Notes in Artificial Intelligence, 3960, Springer. http CLARIN - http://www.mpi.nl/clarin/ http DAM-LR - http://www.mpi.nl/DAM-LR http ELRA - http://www.elra.info http Language Grid - http://langrid.nict.go.jp http LT4eL - http://www.let.uu.nl http LDC - http://www.ldc.upenn.edu Silva, João, 2007, Shallow Processing of Portuguese: From Sentence Chunking to Nominal Lemmatization, MSc Dissertation, University of Lisbon. Nunes, Filipe, 2007, Verbal Lemmatization and Featurization of Portuguese with Ambiguity Resolution in Context, MSc Dissertation, University of Lisbon. 2580 Annex A - API (summary) of LXClient 2581 Annex B - GUIs of the online LX tools 2582 2583