Web Services Architecture for Language Resources
Angelo Dalli*, Valentin Tablan*, Kalina Bontcheva*, Yorick Wilks*, Dan Broeder**, Hennie
Brugman**, Peter Wittenburg**
* NLP Research Group, Department of Computer Science
University of Sheffield
{a.dalli, v.tablan, k.bontcheva, y.wilks}@dcs.shef.ac.uk
** Max Planck Institute for Psycholinguistics, Nijmegen
{dbroeder, hbrugman, pwittenburg}@mpi.nl
Abstract
A web services based architecture for Language Resources utilizing existing technology such as XML, SOAP, WSDL and UDDI is
presented. The web services architecture creates a pervasive information infrastructure that enables straightforward access to two kinds
of Language Resources: traditional information sources and language processing resources. Details about two practical
implementations of this web services architecture are given.
The concept of web services as being lightweight
components that offer an elegant means of integrating
different information repositories and services across the
Internet has always been a main objective in developing a
standard, interoperable system of web services. Industrial
and academic support for web services is increasingly
gaining strength and the future looks promising for their
widespread adoption (Narsu and Murphy, 2002; Conner,
2001; Gates, 2003). The idea of using web services for
Computational Linguistics is also gaining acceptance with
the increasing availability of various useful services
permitting researchers unprecented access to huge
amounts of information and advanced search services like
Google (Google, 2002). Linguistic resources are prime
candidates for web services applications to enable
increased collaboration between research groups and
avoid reduplication of resources and effort. Fortunately,
current web services technology can be used to provide
effective solutions to common problems faced by
researchers (Dalli, 2001; Dalli, 2002).
We propose a web services architecture for Language
Resources that uses a combination of Extensible Markup
Language (XML), Simple Object Access Protocol
(SOAP), Web Services Description Language (WSDL)
and Universal Discovery Description Integration (UDDI)
to achieve maximum benefit from these technologies in a
Computational Linguistics context (Box et al., 2000;
Christensen, et al., 2001; UDDI, 2001). The web services
architecture creates a pervasive information infrastructure
that enables straightforward access to two kinds of
Language Resources: traditional resources such as
lexicons, corpora, semantic networks, etc. and language
processing resources. The use of standard technology
ensures that there is wide support for developers working
with minimal knowledge of web services, and also
guarantees compatibility with legacy applications, while
keeping compatibility
with
major
development
frameworks such as Sun’s Java, IBM’s WebSphere, and
Microsoft’s .NET.
heterogeneous collection of different proprietary formats
and databases with minimal means, if any, of
interoperability with other Language Resources making it
hard to extend their usefulness beyond the life of their
originating projects (Cunningham, 1999). This is an even
more serious issue for smaller projects and Language
Resources for minority languages, since fewer people will
be willing to utilize non-major Language Resources if
there is no commonly accessible metadata description that
enables established tools to be used in an interoperable
manner.
The web services architecture achieves the goals of a
pervasive information infrastructure by using WSDL as its
Language Resource metadata description language, UDDI
as its main publication and discovery mechanism, and
SOAP as the means to retrieve information and execute
remote processes.
One main feature of these technologies is their reliance
on the availability of XML marked up data. Fortunately, a
significant amount of Language Resources are already in a
compatible format such as linguistic data marked up in the
Resource Description Framework (RDF) (Lassila and
Swick, 1999; Klyne et al., 2003), Open Lexicon
Interchange Format (OLIF) (McCormick, 2002), XCES EAGLES-ISLE format (Zampolli, 2000; EAGLES, 2000;
Bertagna et al., 2000) and Encoded Archival Description
(EAD) (NDMSO, 2002). A conversion layer using the
Extensible Stylesheet Language (XSL) or some other
appropriate technology can easily convert this kind of
information into the required XML format.
The web services architecture will need a common
taxonomy, called the Interoperable Extensible Language
Resource (IELR) standard, that caters for the most
common subset of linguistic information used to markup
and classify language resources, together with a similar
component for language processing resources. Due to the
lack of adequate taxonomies for language processing
resources IELR will adapt work done in related projects,
namely the General Architecture for Text Engineering
(GATE) and the Open Archives Initiative (OAI) (OAI,
2001; Wilks et al., 1998).
Existing Technologies
Legacy Applications and Interoperability
Most Language Resources that are currently available for
research and development can be currently classified as a
Legacy applications will need to have a custom made data
conversion layer to ensure that the data can be converted
Web Services
365
into some IELR compatible format. The amount of effort
required for this conversion layer depends on the degree
of structure present in the original format. An extensions
interface in the web services architecture allows legacy
applications, non-standard extensions such as uncommon
language phenomena and entirely new classes of language
processing techniques to be accommodated transparently.
The conversion layer implementation for legacy Language
Resources will thus need to provide necessary
transformations that convert proprietary formats to the
standard core format expected by the SOAP server, while
doing this transformation in reverse to facilitate updates of
the Language Resource by other linguistic applications
and processes.
Interoperability and inter-process communication are
achieved through SOAP. SOAP is used to encapsulate all
IELR resources, acting as a means of accessing relevant
information. SOAP provides XML-based interactions
between different Language Resources and related
applications over the HTTP protocol. Additionally, SOAP
also permits applications to run appropriate processes on
remote computers, using remotely stored data. Remote
process execution on Linguistic Resources is an area that
is still largely undeveloped in the Computational
Linguistics community. Initiatives such as the
amalgamation of grid based computing and web services
will hopefully bring substantial benefits, enabling more
sophisticated large-scale text processing to be performed.
The web services architecture defines a set of criteria
for applications to expose their underlying processing
algorithms – basically services have to support local and
remote data access, need to support pausing and resuming
of their processes, and need to be capable of splitting up
large data processing requests into small manageable
steps. This flexible approach ensures that applications that
conform to the web services architecture specification can
implement their own methods for scheduling and resource
use optimization.
Figure 1 shows how the SOAP layer is used in
conjunction with the conversion layer to provide an
effective encapsulation of the Language Resource,
enabling a common format for all Language Resources to
be expected by applications designed to run on the web
services architecture.
+
,QWHUQHW
, QW
UDQHW
$SSOLFDWLRQV
Additionally, initiatives such as the Open Archives
Initiative (OAI) and GATE already solve many of the
problems that arise in ensuring interoperability and
metadata descriptions of services and content, making
them both suitable for the implementation of diverse
Language Resources. The main drawback to these two
solutions is their reliance on proprietary data formats and
protocols, making it difficult for other third-party
applications to readily interoperate with these
architectures.
Performance considerations were taken into account in
the design of the architecture with communication
overhead and network congestion identified as being the
two most serious bottlenecks. Although XML is the
natural choice for storing and representing linguistic data
due to its simplicity and compatibility with a variety of
existing systems, its main drawback is that pure XML
databases are usually limited in their performance due to
the significant amount of processing needed to encode and
decode huge datasets in what is essentially a pure text
format. A better performance solution in this case was
found to be to utilise a two-pronged strategy where more
efficient data transfer protocols are used in preference to
HTTP for content delivery, and using traditional RDBMS
technology instead of pure XML datasets to speed up
processing. Relation database records can be used to store
linguistic data efficiently with a simple transformation
method converting the relational data to XML format.
LR Metadata Descriptions and LR Discovery
WSDL provides a standard means of creating accessible
XML based metadata descriptions of the Language
Resource being abstractly represented by the SOAP
server. WSDL is used to add an abstract layer describing
the services and features provided by the Language
Resource in a standard manner, significantly reducing the
development time for new applications and related
information
extraction
and
analysis
programs.
Additionally, client applications using WSDL are shielded
from the server implementation, greatly simplifying
maintenance and upgrade of existing facilities.
Figure 2 shows how WSDL acts as a metadata
description layer for the SOAP-encapsulated Language
Resource. WSDL provides a comprehensive means of
describing the mechanisms that should be used to access
and process content pertaining to a specific Language
Resource. A set of abstract operations – that can either
return unprocessed or processed information – are bound
to some network protocol and finally assigned to some
physical address to create a WSDL port. A series of
WSDL ports are then packaged together to form a web
service.
/DQJXDJH5 HVRXUFH
Figure 1 SOAP used for LR interoperability
/DQJXDJH
5HVRXUFH
' D WDEDVHV
Currently, few NLP applications and frameworks
support distributed processing and the notion of
processing resources. The popular GATE architecture
actually has some support for remote process execution
and algorithm abstraction, but this remains an underresearched area in Computational Linguistics.
/DQJXDJH3 U
RFHVVLQJ
5HVRXUFHV
Figure 2 WSDL used for LR extensibility
366
UDDI provides the third key component in the web
services architecture. UDDI provides a global registry of
web services that facilitates the development of an
International Language Resource Directory that aids in the
dissemination of metadata descriptions across different
research projects. UDDI makes it possible for research
projects around the world to find relevant Language
Resources easily according to a particular language or
linguistic phenomenon, and also according to the type of
processing resource needed.
Figure 3 UDDI used for universality and automated LR
discovery and integration
A common taxonomy for Language Resources,
especially for Language Processing Resources, is still not
readily available. Prototype taxonomies were developed
for our practical development experiments, but significant
work is envisaged to get the computational linguistics
community to agree on a standard UDDI taxonomy that
enables Language Resources created by various projects
to be classified and matched up accordingly using the
automated search functions already provided by the UDDI
servers.
Prototype Applications
Prototype applications of the web services architecture
have been made for two applications at the Max Planck
Institute for Psycholinguistics and the University of
Sheffield, in a data-oriented and processing-oriented
context respectively. These prototype applications enabled
the theoretical framework of the web services architecture
to be implemented in practice, gaining additional insights
into the process, while verifying the ease with which web
services can be added to existing applications.
Controlled Vocabulary Service
The controlled vocabulary service was implemented at the
Max Planck Institute for Psycholinguistics using a team of
two developers over the course of two days, with an
additional day for testing and integration. The MPI
application involved adding a web service on top of a
controlled vocabulary application, related to the IMDI
project, which enables researchers to find and exchange
relevant controlled vocabularies using web services
(Wittenburg et al., 2000). The main problem was in
enabling new users of the controlled vocabulary
application to automatically publish and share their
controlled vocabularies with other users. Since the data
was already stored in a standard XML-based format, the
effort focused on utilizing the publication and search
features provided by the web services architecture to
create an effective solution to this problem.
The service was implemented using the Java Web
Services Development Toolkit (JWSDP). JWSDP
provides the basic packages needed to add web services to
any Java application. Four additional sub-packages were
needed, namely the Java XML processing package
(JAXP), XML-based RPC (JAX-RPC), Java SOAP-based
messaging methods (JAXM) and the Java UDDI registry
access methods (JAXR). The main process involved the
creation of a WSDL stub on both the server side and the
client side. The service definition was initially defined and
then implemented to create the WSDL stubs using the
automated mapping tool provided in JWSDP. The web
application components, together with helper classes and
static resources needed to support the implementation
were created to obtain the final Web Application Archive
(WAR) file needed to deploy the web service.
The main difficulty encountered in the MPI
implementation was the fact that minimal support existed
for Java web services at the time when this prototype was
created in 2002. The situation has somewhat improved
with the availability of more sophisticated tools that
reduce the amount of steps and integration needed, as seen
in our second prototype development. Some problems
were also encountered in the integration of our prototype
taxonomy with UDDI, although this was expected since it
was the first time UDDI was used for this purpose.
GATE ANNIE Service
At the University of Sheffield, the web services
architecture was implemented on top of GATE’s ANNIE
component, which aims to eliminate the need for users to
keep re-implementing frequently needed algorithms and
provide a good starting point for new applications
(Cunningham et al., 2002; Maynard, 2002). The service
was implemented using the Apache Axis Toolkit, which
provides a web services extension to the popular Apache
server, enabling existing web based applications to
utilised web services with minimal redevelopment effort.
In order to lower the application development
overheads, GATE provides a number of useful and easily
customizable components, grouped together to form the
ANNIE (A Nearly-New In formation Extraction)
component1 . These components eliminate the need for
users to keep re-implementing frequently needed
algorithms and provide a good starting point for new
applications. The majority of these components use
GATE’s finite state techniques to implement various tasks
from tokenisation to semantic tagging and coreference,
with an emphasis on efficiency, robustness, and low1
A demonstration of how these components can be used to
highlight information in Web pages is available at
http://gate.ac.uk/annie/index.jsp
367
overhead portability, rather than full parsing and deep
semantic analysis. GATE also provides an extendable set
of document format handlers (e.g., XML, HTML, RTF,
email), which translate the document content and the
formatting information into GATE’s shared data model, in
a similar way to the conversion layer in our web services
architecture.
GATE’s graphical development environment also
enables the user to create and store GATE applications, so
that GATE can load and configure all modules
automatically at subsequent executions. Users choose
which processing resources go into their application (e.g.
tokeniser, POS tagger), in what order they will be
executed, and on which data (e.g. document or corpus).
For example, ANNIE is a stored GATE application which
when selected for loading, automatically loads and
configures all its components. The GATE Web services
API uses GATE applications to allow users to turn their
applications automatically into web services, by providing
their stored application to the server.
The effort involved in using the Axis toolkit to create
the GATE web services API encapsulating ANNIE was
considerably simpler than with using JWSDP on its own.
A GATE web service interface was defined with various
methods corresponding to ANNIE’s methods. Stub
implementations were provided for all ANNIE methods,
and a WSDL file was generated automatically together
with an Apache Axis configuration file. After the server
application was built, a corresponding client application
was built into GATE to consume web service calls
provided by other GATE implementations, effectively
turning GATE into both client and server simultaneously.
Conclusion
The experience gained from applying theory in practice
shows that although web services still need to improve in
certain areas, they can provide useful results in a short
time, with two days of actual development time being
needed in both prototype experiments. The two prototype
experiments also provided useful insights as to how both
traditional data-oriented (MPI) and processing-oriented
(GATE) Language Resources can be encapsulated
seamlessly along with data resources to form a truly
accessible Language Resource.
The low learning curve and minimal costs involved in
integrating these technologies into existing projects makes
the proposed system highly attractive for small and
medium sized projects that have limited available
resources.
References
Bertagna, F. Calzolari, N. Lenci, A. Zampolli, A. (2001).
The Multilingual ISLE Lexicon Entry (MILE). ISLE
Computational Lexicons Working Group Report, Italy.
Box, D. et al. (2000). Simple Object Access Protocol 1.1.
W3C Note. http://www.w3.org/TR/SOAP
Christensen, E. et al. (2001). Web Services Description
Language 1.1. W3C Note. http://www.w3.org/TR/wsdl
Conner, M. (2001). Web Services: The next horizon for ebusiness, International Business Machines Corporation,
Executive Presentation, Armonk, New York.
Cunningham, H. (1999). A Definition and Short History
of Language Engineering. Journal of Natural Language
Engineering, Cambridge University Press, 5:1-16.
Cunningham, H. Maynard, D. Bontcheva, K. Tablan, V.
(2002). GATE: A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. In Proceedings 40th Anniversary Meeting
of the Association for Computational Linguistics (ACL
2002). Budapest.
Dalli, A. (2001). Interoperable Extensible Linguistic
Databases , In Proceedings IRCS Workshop on
Linguistic Databases , University of Pennsylvania,
Philadelphia.
Dalli, A. (2002). Creation and Evaluation of Extensible
Language Resources for Maltese, In Proceedings 3rd
International Conference on Language Resources and
Evaluation (LREC) 2002, Las Palmas de Gran Canaria,
Spain.
Expert Advisory Group on Language Engineering
Standards (EAGLES). (2000). Corpus Encoding
Standard for XML. Vassar College, New York. Equipe
Langue et Dialogue LORIA/CNRS, France.
Gates, B. (2003). Keynote speech at Microsoft
Professional Developers Conference 2003 (PDC), 27
October 2003, Los Angeles, California.
Google Inc. (2002). Google Web API, Technical
Documentation,
Mountain
View,
California.
http://www.google.com/apis
Klyne, G. Carroll, J. McBride, B. (2003). Resource
Description Framework (RDF): Concepts and Abstract
Syntax. http://www.w3.org/TR/rdf-concepts
Lassila, O. Swick, R. (1999). Resource Description
Framework (RDF) Model and Syntax Specification.
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222
Maynard, D. Tablan, V. Cunningham, H. Ursu, C.
Saggion, H. Bontcheva, K. Wilks, Y. (2002).
Architectural elements of language engineering
robustness. Journal of Natural Language Engineering,
Special Issue on Robust Methods in Analysis of Natural
Language Data, 8(2/3):257–274.
McCormick, S. (2002). Open Lexicon Interchange
Format. OLIF Consortium. http://www.olif.net
Narsu, U. Murphy, P. (2002). Web Services adoption
outlook improves. Giga Information Group, Report
RPA-042002-00011.
Network Development & MARC Standards Office
(NDMSO). (2002). Encoded Archival Description,
Library of Congress, Washington D.C.
Open Archives Initiative (OAI). (2001). The Open
Archives Initiative Protocol for Metadata Harvesting.
http://www.openarchives.org
UDDI Consortium. (2001). UDDI Version 2.0 Data
Structure Reference. http://www.uddi.org/pubs
Wilks, Y. Gaizauskas, R. Cunningham, H. (1998). GATE:
General Architecture for Text Engineering. University
of Sheffield, Sheffield.
Wittenburg, P. Broeder, D. Sloman, B. (2000).
Documentation of Languages and Archiving of
Language Data at the Max Planck Insitute for
Psycholinguistics
in
Nijmegen.
Presented
at
"Ringvorlesung Bedrohte Sprachen" Sprachenwert –
Dokumentation, Revitalisierung. Fakult?t fur Linguistik
und Literaturwissenschaft, Universit? t Bielefeld
Zampolli, A. (2000). Extensions of PAROLE & SIMPLE
resources: National Projects. SIMPLE: From
Monolingual to Multilingual Resources Workshop,
Athens.
368