Web Service Gateway - A Step Forward To E-Business: Hoang Pham Huy Takahiro KAWAMURA Tetsuo HASEGAWA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Web Service Gateway a step forward to e-business

Hoang PHAM HUY


Ph.D
Toshiba R&D Center
[email protected]

Takahiro KAWAMURA
Ph.D
Toshiba R&D Center
[email protected]

Abstract
Business-to-Business will be a considerable market in
the near future of Internet e-business. In this future market,
several providers need to be able to integrate or exchange
information in providing a global service. The problem
that we want to tackle in this paper is related to the
existing information sources in the current Internet
environment. That is how to integrate existing Web sites
each other to become a new Internet service ? The
difficulty comes from a historical objective. Internet Web
sites were developed for human users browsing and so,
they do not support machine-understandable as well as
inter-provider interaction. To overcome this gap, we need
a framework to systematically migrate the existing
presentation-oriented Web sites to service-oriented one.
Evidently, redeveloping all of them is an unacceptable
solution.
In this paper, we propose a mechanism of Web Service
gateway in which existing Web sites are wrapped by
several Web Service wrappers. Thus, without any efforts
to duplicate the Web sites code, these services inherit all
features from the sites while can be enriched with other
Web Service features like UDDI publishing, semantic
describing, etc As a consequence, they can be easily
integrated each-other in a Business-to-Business schema to
provide a more valuable service for users.
This Web Service gateway was developed in Toshiba
with Web Service Generator, allowing automatically
generate Web Service wrappers. By using this system,
several real Web services were generated and made
available for use. The Web Service gateway and these
services are also presented and evaluated in this paper.
Keywords: Web Service, Service Gateway, Wrapper,
Parser, CGI, WSDL.

1. Introduction
The success of the Internet does not only allow the
connection of computers and business partners
world-wide but also open a new way to carry out the
business transactions. Supply and commerce over the
Internet, such as online weather forecast ([9]) or online
book shops ([10][11]), have already entered to the market
as individual information sources. Broadly saying,

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

Tetsuo HASEGAWA
Research Scientist
Toshiba R&D Center
[email protected]

Business-to-Customer oriented and not-open, are the


two principle characteristics of almost all these
information sources. The latter can be considered as a
consequence of the former since supporting
Business-to-Customer schema does not require one
information source to be opened to others. However,
Business-to-Business must be a considerable market in the
future e-business in which each provider needs to share its
capabilities with others. This future market also requires
an infrastructure to integrate several providers in a global
service, in supporting other service-independent features
like accounting, billing, security, etc. Evidently, Web
Service technology is the most promising candidate. The
question that we want to tackle in this paper is related to
the existing information sources in the current Internet.
That is how to harmonize them with the future
Business-to-Business market ?
A common way to import the existing information
sources to a new market is installation of wrapper
components that act as representatives of the old
providers. Though that it is not a new research domain,
adopting this wrapper mechanism in our approach under
the Web Service strategy turns it out as a promising
solution. Thanks to the current efforts of developing and
standardizing Web Service technology, our Web Service
gateway brings all the advanced features of Web Service
to the current existing Internet information source, without
paying many efforts to re-developing them. Keeping in
mind that more than 80% current information sources in
the Internet realized on dynamic Web sites with an
underlying database ([1]), we concentrate to support this
kind of Web sites. Moreover, since Common Gateway
interface (CGI) is mostly used in dynamic Web sites, our
gateway was designed to fully support CGI. The other
similar mechanisms such as Active Server Page (ASP)
and Java Server Page (JSP) can also be supported just by
adding a necessary library to the gateway. Figure 1 shows
general components of our Web Service gateway and the
way to import existing Internet information sources to
Web Service domain. Because Internet information
sources differ from each other in the way to access
(protocol, query structure, etc) and in the format of
returned HTML pages, a different wrapper is required to
represent each information source. These wrappers are
generated with the help of Wrapper Generator System and

project on 1997 ([4]), wrapper was used to provide a


universal access mechanism for heterogeneous
information source including database, Web site, etc
The key point of TSIMMIS is not creating wrapper it-self
but customizing wrappers interface, based on a concept
of wrapper-template. For that, TSIMMIS provides several
hard-code wrappers for different information sources.
These wrappers, when being used in a particular
application, can be customized with newly defined
wrapper-templates in order to provide an appropriated
interface for the application. This approach has a
weakness is that each information source requires a new
hard-code wrapper and moreover, wrapper generation
was completely lacked in TSIMMIS. However, the idea of
wrapper-template is a strong point and it is inherited in
our approach to automatically generate wrapper, based on
some definitions of users.
Jedi ([6]) is another project organized in German some
years after TSIMMIS. Within
objective
to
mostly
support
Current Internet Environment
information sources of Web site,
Jedi concentrates to the procedure
Info. source 1
Info. source 3
of parsing and extracting data in a
Info. source 2
HTML page. For that, Extraction
Gather info.
Language was defined to pattern
of individual
sources
data in a HTML page and then, a
Jedi parser (provided as several Java
Wrapper
Web
libraries) can be used to extract the
Generator
Service
data in the patterned HTML page.
Gateway
Generate wrappers
Like TSIMMIS, Jedi also did not
Wrapper
Wrapper
Wrapper
& deploy to
consider the aspect of wrapper
gateway
Generic Wrapper System
generation. It just provides a
mechanism and several Java
libraries to create HTML parser and
extract concrete data in a HTML
page. By taking the design approach
Web Service
Web Service
of plug-in, out Web Service
gateway can apply Jedi mechanism to create an Intelligent
Figure 1: Toshiba Web Service Gateway
HTML Parser. It will be presented in section 3.2.
Automatically wrapper generation was considered in
In the next section, we briefly describe several related
UMICAS ([3]). Qualified-path-expression Extractor
works in pointing out several related aspects in our
Language (QEL) and Complex Extractor Specification
approach. Then, section 3 and 4 describe in detail the
Language (CESL) were used to define the query that
architecture and implementation of our Web Service
wrappers should use to extract data in a HTML page. User
gateway. Section 5 presents a demonstration of using out
can examine a HTML page and use a GUI toolkit to
Web Service gateway to create an e-business service from
define query with these two languages. Then, a wrapper
several existing Internet information sources. The last
can be automatically generated by just one mouse click in
section concludes our work and draws our several future
the GUI toolkit. Although this project had quite
works.
completely covered the Web site wrapper domain,
standardization aspect is its weak point. QEL and CESL
2. Related Work and Our Approach
were not widely accepted as a standard method for
extracting data in HTML page. The generated wrappers
The wrapper idea has been considered for several
also did not provide a standard interface to receive clients
years. Basic mechanism is the same but in each period of
request. Nowadays, XPath ([12]), XQuery ([13]) and
time, the applied methods were different according to the
DOM ([14]) specifications are widely accepted in place of
current trend of technology. In Stanfords TSIMMIS
QEL and CESL. For wrappers interface to receive
automatically deployed to the gateway. They take the
responsibility to
x receive requests from service clients in Web
Service domain;
x convert the request to an appropriate form before
sending to the existing information source;
x get the returned HTML pages and extract the
required concrete data;
x return the data to service client in Web Service.
These tasks are carried out by using several
appropriated supports from the Generic Wrapper System,
available in the gateway. In order to support different
technologies in existing Internet sources as well as in Web
Service domain, plug-in was adopted as designing strategy
in our Web Service gateway. For instance, if we need to
support a new kind of information source, a new plug-in
can be added in the Generic Wrapper System and
providing necessary library for newly wrappers.

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

The logical design of our Web Service wrappers can


be seen in figure 2. Each wrapper consists of 4 basic
modules:
Web Service Domain

WS Wrapper

request, Web Service Description Language (WSDL) is


world-wide accepted specification. Thats why they are
taken into account in our approach.
Semi-automatic
wrapper
generation ([2], [7], [8]) can be
considered as the most advanced
wrapper generation mechanism
currently. User with the help of a
GUI can analyze a HTML page and
define several data extraction rules.
The rest of wrapper generation
procedure is delegated to a toolkit.
Wrapper
This semi-automatic way can Description
support complicated tasks in
seeking data of a HTML page by the
help from users while releasing
them from several common
uninteresting routines. However,
these works all try to define their
own descriptive language in order to
define the data extraction rules
which is, we think, not an intelligent
approach. This strategy implicitly
excludes the possibility of applying other mechanisms to
extract data from HTML page. Keeping in minds this
observation, our approach also follows semi-automatic
wrapper generation mechanism but try to define a
common framework for Intelligent HTML Parsers, which
are the plug-in in our Web Service gateway. This
framework provides a common interface that wrappers
can use to extract data in HTML pages. On the other hand,
each plug-in can be implemented with different data
extraction mechanism, based on different descriptive
languages to pattern data in HTML pages. Users can use
our plug-in discovery tool to find out which plug-in parser
is the most appropriate for the HTML page pattern of a
particular Web information source, then bind this
plug-in to the corresponding wrapper. The wrapper
description it-self is not changed. This is the
intelligence aspect in our HTML parser framework.
Concerning the characteristic of most appropriate parser,
wrapper verification concept was also proposed in [5].

Web Service Interface (WSDL)

Coordinator

Internet
Info. Source

Figure 2: Web Service Wrapper


x

3.1. Web Service Wrapper

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

Parser Plugin

Info. Source Access Protocols

3. WS Gateway and WS Wrapper


As generally presented in figure 1, our Web Service
gateway consists of two principle parts the Web Service
Wrappers and the Generic Wrapper System. These
components will be discussed in the following
sub-sections.

Data
Extractor

Web Service Interface. This interface is defined


with WSDL, providing an entry point to access
to the wrapper. Through this interface, other
entities in Web Service domain see and make use
of this wrapper as a normal Web Service.
Evidently, other Business-to-Business supported
mechanisms in future Web Service domain such
as
UDDI, ontology-based service seeking,
etc can be applied with these Web Service
wrappers.
Wrapper Coordinator. This module takes the
responsibility to transfer the requests received
through Web Service interface into another form
that the Internet information source can
understand. For example, for wrapping CGI Web
sites, the Wrapper Coordinator module coverts
the parameters value in Web Service interface
into a CGI query conforming to the CGI Web
site.
Information Source Access Protocol. This
module provides an engineering feature to
transport the requests, after being harmonized by
Coordinator, to the information source. It also
takes the responsibility to retrieve the HTML
pages, which is the reply from information
source. Depended on the technique that the
Internet
information
source
uses,
the
corresponding accessing protocol is applied in
this module by a driving from Coordinator. For
example CGI GET, CGI POST or ASP, JSP,
etc

root.node(1).node(2)node(n)

Data Extractor. This module takes the


responsibility to extract data from HTML pages
and construct an appropriate data structure. This
data structure is then returned to the requesting
entity in Web Service domain, through Web
Service Interface. Instead of hard-code
integrating HTML data extraction mechanism,
this module communicate with several
independent HTML parsers (the Parser plug-in in
Web Service gateway) to carry out its task.
The key point in our approach is that the wrappers are
not hard-coded in Web Service gateway but they are
automatically generated by Wrapper Generator tool, based
on a wrapper description. This description contains all
necessary information of the Internet information source
(accessing protocol, request format, etc), the signature
of the Web Service interface as well as the mapping
between these two items. By designing wrappers with
several independent basic modules and supporting
modules generation feature, our Web Service gateway can
be easily adapted to new business environment or
different kinds of information source. For example, to
provide a gateway interface for J2EE Enterprise Java
Bean environment, the wrapper description files can be
changed and a newly wrappers class can be generated
with EJB interface in the place of Web Service Interface.

3.2. Intelligent HTML Parser Plug-in


As described above, data extraction from HTML pages
is carried out by several intelligent HTML parsers. These
parsers are deployed in the Generic Wrapper System of
the Web Service gateway as the plug-in entities. By
adopting this design strategy, any third parties can
develop their own HTML parsers, based their own
mechanism, and deploy to our Web Service gateway to
make them available to the wrappers. For that, we need to
define a standard framework that all HTML parsers
implementation must follow to be able to execute in our
gateway. The following two mandatory requirements are
strongly considered in our framework design:
x The algorithm to extract data from a HTML page
and the way to implement it should be as flexible
as possible so that third parties can freely (as
much as possible) decide the way to develop
their intelligent HTML parser.
x One standard communication mechanism must
be designed so that all HTML parsers must
support to communicate with the Web Service
wrappers.
For the first requirement, as the fact that all HTML
pages are constructed by a tree-style HTML tags, we
propose to abstract all HTML pages format to some kinds
of tree-styles nodes. Thus, the path to locate data in a
HTML page can be abstract to:

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

root

Structured in
another tree-style

Intermediate
node

data node

One HTML page


Structured
in tree-style

root

path from root to data node


Intermediate
node

data node

Figure 3: Parsing a HTML page by flexible tree-style


We call this is a path-finder sentence which consists of
several node-finder elements. Each node-finder element
(node(1), node(2), etc) can be constructed by an explicit
node address such as root node or a relative address in
comparing with the node parent. The intelligence in this
approach is that even the abstract HTML page format if
fixed to tree-style but the realization of these trees are
completely depended on the implementation. So does the
node-finder. Depended on the parser implementation
mechanism, a node-finder can be an explicit digit position
like 1.0.3.5 or a textual position expression like
second DIV tag or first DIV tag after 1.0.3.5.
Figure 3 shows that the same HTML page can be
represented in two different tree-style structures with two
different path-finder sentences to locate the same data.
Thus, the procedure to extract data from HTML page
become is to find out the path-finder sentence. In more
detail, it is a sequence of information exchange between
the wrappers Data Extractor module and an appropriate
HTML parser plug-in to, firstly, construct a tree-style
structure and, secondly, setup the next node-finder, and so
on. Evidently, the information that the wrappers Data
Extractor module sends to the HTML parser to establish
the path-finder sentence must be conformed to the parser
implementation. This information is given to the wrappers
when they were generated and deployed to the Web
Service gateway, by its wrapper description.
Figure 4 presents the standard communication
framework between wrappers Data Extractor module and
HTML parser plug-in to extract data. After receiving a
HTML page from the information source, the wrappers
Data Extractor module contacts with the Web Service
gateway to initialize an appropriate HTML parser. After
initializing the HTML parser, Web Service gateway
returns the parser reference to the Web Service wrapper.
By invoking several standardized operation to this parser
reference, the wrapper requests the parser to initialize a

Center. This toolkit consists of 2 main components Wrapper Generator and Plug-in Discovery.
Figure 5 presents the main windows of our Web
Service Wrapper Generator (WSGen). This tool supports a
GUI to analyze information source to define wrapper
description. Then, by just one mouse click, the wrapper
code is generated and deployed to Web Service gateway.
In the first version, we concentrate to the Internet
information source based on CGI technology. For that, by
enter a normal CGI
WS Wrapper
WS Gateway
HTML parser
query, including the
protocol, CGI Web
Receives returned HTML page from the info. source
site
address,
resource path and
Request contact with a HTML parser
the query pattern,
WSGen
[ contact point ]
Initialize a HTML parser and return the contact point
automatically
generate a primitive
Web service for the
Initialize tree nodes
given CGI Web site.
Web
Service
[ root node address ]
interface and CGI
mapping is carried
Next node-finder sebtebce
out by several steps,
in several working
[ next node address ]
tab. In each tab,
user
can
Get data attached in current node
graphically create,
[ data ]
modify,
delete,
etc the elements
of the wrapper description. After generating a Web
Figure 4: Standard Communication Framework
Servicewrapper, user can go to Deploy working tab in
Based on above designing strategy, we define a set of
WSGen to deploy it to the Web Service gateway. Toshiba
operations that any intelligent HTML parsers must
provides a default Web Service gateway that is delivered
provide to be invoked by the wrappers. The most
with WSGen. Otherwise, user can take the generated Web
important plug-in function is getNodePosition():
Service gateway (packed in a WAR file Web
getNodePosition(parrentPosition, node-finder)
Application Archives) to deploy it in any other Web
This function calculates and returns the position of the
Service hosting systems. This WAR file is packed with all
current node in the tree, based on the parent node and the
necessary libraries such as accessing protocol, HTML
relative distance to the given node. The position format
parser plug-in, etc so it can be deployed to any Web
of these nodes (parent, current and node-finder) is
Service hosting server which supports WAR specification.
completely depended on the plug-in implementation. For
Figure 6 presents the main windows of our Plug-in
example, one plug-in accepts an explicit node tree-based
Discovery. This tools allows user to try all the HTML
digit position like 1.0.3.5 while another can provide a
parsers available in the Web Service gateway to find
plug-in which accepts both above explicit digit position
which is the most suitable for parsing HTML pages
and a textual position expression like second DIV tag
returned from a given information source. This tool also
or first DIV tag after 1.0.3.5. XPath, XPointer or
allows user to define the node-finder sentences. After
XQuerry can be applied here to define the node-finder
selecting an available HTML parser, the tree node is
sentence. Thus, the more human-like in position format
shown to the user. When user click to a node in HTML
that a plug-in HTML parser accepts, the more intelligent
page,
the discovery engine will analyze the data
in the parser to extract data.
tree-style nodes represented to the current HTML mode.
Then, the wrapper invokes one of several times to the
parser to send the node-finder sentences as defined in the
wrapper description to find the data node. As explained
before, it is depended on the parser implementation that
the node-finder can be various, such as go to the first
image node or go to the second children node. Finally,
the wrapper requires the HTML parser send back the data
in current data node.

4. Wrapper Generator Toolkit


In order to support users to define the wrapper
description and to generate wrapper code, Web Service
wrapper generator toolkit was developed in Toshiba R&D

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

attached to this node and propose all the possible


code-finder sentences to go to this node.

W
WSSGenerator
GeneratorTools
Tools

W
Working
orkingtab
tabname
name

Current
CurrentW
Working
orkingtab
tab

W
Working
orkingbuttons
buttons

Figure 5: WS Wrapper Generator main windows

Original
OriginalOperation
OperationURL
URL

W
Working
orkingzone
zone

Docum
ent structure
Document
structurezone
zone

Docum
ent preview
Document
previewzone
zone

Figure 6: Plug-in Discovery Tool

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

5. Applying Web Service Gateway


6. Conclusion and Future Works
By using Toshiba WSGen, several wrappers were
generated for certain CGI Web sites like Yahoo, Amazon,
etc and deployed to Toshiba Web Service gateway.
This gateway provides Web Service interfaces inside our
group in order to test the feasibility of several Web
Service in future business environment. In this section, we
introduce one of our applications based on this system
the Patent Looking & Translating (PLT) service. PLT
allows users to looks for patents and related documents,
and then translate them into a requested language. This
service was developed in our group in Toshiba is
demonstrated in the exhibition of Toshiba R&D Center on
December 2003. The principle architecture of PLT service
is displayed in the following figure. Agent technology is
used as an enabling technology for integrating different
wrappers.

PLT Service is not the first searching and/or


translating service in the Internet. There are also several
web sites providing such kind of service. However, these
sites normally provide only either searching or translating
feature, not both. By applying our Web Service gateway,
we
provide
a
standard
mechanism
for
Business-to-Business with existing Internet resources.
This product of Web Service gateway have been
completed in our laboratory and started to be distributed
in the market.
As for future works, several directions will be
envisaged. The first one is integrating Web Service
Gateway with MatchMaker ([13]), a Web service
matchmaking system supporting semantic description. It
will allow wrappers, after being generated, to be
automatically published with semantic description. The
second
direction
is
integrating
Web
Service
Gateway with
Bee-gent,
an
agent platform
developed
in
Toshiba R&D
Center.
Another future
work is to
develop
new
intelligent
HTML parser
allowing
wrapper
to
perform more human-like actions.

Figure 7: PLT Service Architecture


Currently, there are a lot of CGI Web sites providing
online document and patents search. ITPapers
(www.itpapers.com),
NEC
CiteSeer
(www.citeseer.nec.com), US Patent (www.uspto.gov),
etc are those examples. Other online translation services
such
as
Altavista
Translation
(www.babelfish.altavista.com) are also quite popular in
Internet. PLT Service was created from these Web sites,
with our Web Service Gateway and Bee-gent, a Mobile
Agent platform also developed in our group. PLT Service
allows users to look for patents and related documents by
entering several keywords and a requested language. An
agent is created and migrates to several hosts to search the
appropriate documents. It then brings all documents that
its found to a translation host and translates them into the
requested language.

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

7. References
[1]

[2]

[3]

[4]

A. Sahuguet and F. Azavant, Building Light-Weight


Wrappers for Legacy Web Data-Sources Using W4F,
Proceedings of 25th International Conference on VLDB,
1999, pp. 738-741
R. Baumgartner, S. Flesca, and G. Gottlob, Visual Web
Information Extraction with Lixto, 2001, Proceeding of
27th Conference on VLDB, pp. 119-128
J. Gruser, L. Raschid, M. Vidal, and L. Bright, Wrapper
Generation for Web Accessible Data Sources,
Proceeding of Conference on Cooperative Information
Systems, 1998, pp. 14-23
J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni,
M. Breunig, and V. Vassalos, Template-Based Wrappers
in the TSIMMIS System, Proceedings of 23rd ACM
SIGMOD Conference on Management of Data, 1997

[5]

[6]

[7]

K. Nicholas, Wrapper verification, World Wide Web,


Kluwer Academic Publishers, Volume 3, Issue 2, 2000,
pp. 79-94
G. Huck, P. Frankenhauser, K. Aberer, and E. Neuhold,
Jedi: Extracting and Synthesizing Information from
Web, Proceeding of 3rd Conference on Cooperative
Information Systems, 1998, pp. 32-43.
L. Lui, C. Pu, and W. Han, An XML-Enabled Wrapper
Construction System for Web Information Sources, 2000,
Proceeding of 15th Conference on Data Engineering
(ICDE), pp. 611-621.

Proceedings of the IEEE International Conference on Web Services (ICWS04)


0-7695-2167-3/04 $ 20.00 IEEE

[8]

[9]
[10]
[11]
[12]
[13]
[14]

M. Christoffel, B. Schmitt, and J. Schneider,


Semi-Automatic Wrapper Generation and Adaption,
Proceeding of Conference on Enterprise Information
Systems, 2002
AccuWeather: http://www.accuweather.com
Amazon book shop: http://www.amazon.com
Isbn.nu book shop: http://www.isbn.nu
XPath Specification: http://www.w3c.com/TR/xpath
XQuery Specification: http://www.w3c.org/XML/Query
W3C
Document
Object
Model
(DOM):
http://www.w3c.com/DOM

You might also like