Architecture of Deep Web: Surfacing Hidden Value: Suneet Kumar Virender Kumar Sharma
Architecture of Deep Web: Surfacing Hidden Value: Suneet Kumar Virender Kumar Sharma
Architecture of Deep Web: Surfacing Hidden Value: Suneet Kumar Virender Kumar Sharma
3, 2011
+ -/+ -
+ -/+ +
+ + +
Flexible data access layer Query rewriting based on query capabilities Extensibility/modular architecture
I.
The importance of Deep Web (DW) has grown substantially in recent years not only because its size [12], but also because Deep Web sources arguably contain the most valuable data, as compared to the so-called Surface Web [4]. An overlap analysis between pairs of search engines conducted
Dynamic + + challenges in data presentation technology Table 1. Characteristics of data integration (DI), information extraction from the Web (IEW) and Deep Web data integration (DWI)
In the paper we describe a classification framework allowing comparing different approaches based on the full
September Issue
Page 5 of 75
model of data extraction and integration process. We propose also the refinement of the architecture models that would cover the complete data extraction and integration process for Web sources. II.
DEEP WEB DATA EXTRACTION AND INTEGRATION PROCESS
In integration process two continuously executed phases can be distinguished: system preparation creation and continuous update of resources required for usage of system system usage actual lazy or eager data integration from dispersed sources System Usage Resources
List of Web sources Content description Select web Sources
International Journal of Computer Information Systems, Vol. 3, No. 3, 2011 navigation and extraction rules that allow the system to get to the data through a sources Web interface. These resources are used once query needs to be answered: sources to ask are selected based on their subject and coverage, the plan is generated using schema mappings and source query capabilities and the query is executed through a Web interface using navigation model for particular queries and data extraction rules defined for them. Once data are extracted from HTML pages, they need to be translated and cleaned according to data translation rules. The overall process includes all the activities necessary to integrate data from Web databases identified so far in the literature, and adds few steps (discovery of web sources, declarative description of navigation model) that were neglected so far. As such it can serve as a basis for generic data integration system. III. OPERATONAL SOURCE SYSTEMS
System Preparatory
These are the operational systems of record that capture the transactions of the business. The source systems should be thought of as outside the data warehouse because there can be no or little control over the content and the format of the data in these operational legacy systems The operational systems contain little or no historical data, and if we have a data warehouse, the source systems can be relieved of much of the responsibility for representing the past. Each source system is often a natural stovepipe application, where little investment has been made to sharing common data such as product, customer, geography or calendar with other operational systems in the organization. IV.
CLASSIFICATION OF DATA EXTRACTION AND INTEGRATION APPLICATIONS
Integrate schemas
Execute query
Consolidate data
We identified 13 systems most prominently referred in the subject literature (AURORA [17], DIASPORA [19], Protoplasm [15], MIKS [2], TSIMMIS [5], MOMIS [1,3], GARLIC [18], SIMS [6], Information Manifold [11], Info master [14], DWDI [13,9], PICSEL [21], Denodo [20]) to base our architecture on the approaches reported to date. We used the following set of criteria to evaluate existing systems architectural decisions (it arose as an outcome of requirements posed by integration process and intended generality of the proposed architecture): - does the system support sources discovery, - can sources content / quality be described, - what are the means for schema description and integration - is source navigation present as declarative model or hard coded, - how are querying capabilities expressed, - how are extraction rules defined, - how are data translation rules supported, - is query translation available, - can the system address Deep Web challenges, - can the system be distributed,
Data Flow
Control Flow
During system preparation phase, first the sources need to be discovered. This step is currently assumed to be done by human beings the list of sources is usually provided to the integration system. Subsequently the sources need to be described their schema, query capabilities, quality, coverage needs to be provided for further steps in the integration process. Most importantly schema descriptions are used in the next step to map them either one to the other or to some global schema. The last part of system preparation is a definition of
September Issue
Page 6 of 75
- What is the supported type of information (unstructured / semi-structured / structured). After studying the literature the following conclusions were drawn (we restrict ourselves to conclusions only due to lack of space to present the detailed analysis results): - Almost none of the system supports sources discovery and quality description (only PICSEL supported the latter with manual editing). - Schema description usually required manual annotation - There was wide spectrum of approaches to schema integration (from purely manual to fully automatic) with variety of reported effectiveness for automatic integration. Most noteworthy Older systems tried to approach this issue automatically, while newer rely on human work. - Source navigation was usually hard coded in wrappers for particular sources - Extraction and data transformation rules were either hard coded or (rarely) learned by example and represented declaratively. - Only few systems (Protoplasm, Denodo, Aurora) were applicable to Deep Web out-of-the-box. Most of the other systems would require -Either serious (including underlying model refinement) or minor (technical) modifications. - Most of the systems were intended for structured (usually relational) information and few of them could work as distributed systems (MIKS). V. CONCLUSION
International Journal of Computer Information Systems, Vol. 3, No. 3, 2011 Source description editor this module is used for modifying list of sources, their schemas and descriptions. Recording, manual specification or edition of navigation model and data extraction rules is also a functionality of this module. Automatic schema mapping module it provides (during preparation phase) means for grouping or clustering of sources, discovery of schema matching and data translation rules. Schema mappings editor respective editor for the declarative mappings, clusters and data translation rules provided by previous module. Business-level data logic this module is essentially external to the system it can be applied once the data is retrieved and integrated to conduct domain-dependent data cleansing, records grouping, ranking and similarity measurement and even data mining tasks. The logic here is somewhat external to the data integration logic even though there may be scenarios where parts of business logic may seem to fit better in the last parts of data consolidation process. Data mediator used in the query phase, it performs union on data from several sources and executes post-union operations from query plan. It may also maintain mediatorspecific query cost statistics. Query planner is responsible for sources selection (it evaluates, based on the source description, which should actually be considered for querying) and performs query optimization based on cost statistics. Important feature of query planner in data integration (as opposed to traditional database query planners) is query rewriting for specific sources w.r.t to their query capabilities. The planner has to take into account sources limitations and may schedule some operations to be performed after the data is unified. Wrappers (query execution modules) are used to execute the navigation plan for particular query. There is one-to-one mapping between query types supported by the source and navigation models for those queries, therefore wrappers simply execute navigation plans once given queries that they should answer. During navigation they perform data extraction and translation. They can also provide source specific cost statistics. Many of the features executed during preparation phase may be performed automatically; however, as visible from the components description, we assume that corresponding editors also exist that allow for post-edition of results of automated processes to increase systems quality of service and may replace automatic solution in case that automation of some tasks is not possible (missing technology) or not beneficial (too low reliability of automated process). Some of the functions of individual components listed above (e.g. schema matching [8, 7]) may be performed by several different approaches. Thus we propose that our architecture include placeholders (interfaces) supporting dynamically loaded plug-ins for several functions of subcomponents. We describe proposed types of plug-ins below. Modular approach to these components has direct impact on proposed architectures flexibility and extensibility. Another feature of our architecture is level of parallelization or
Based on the analysis of the existing systems and the complete Deep Web data integration process presented above we propose the extensible architecture for a Deep Web data integration system. The factors that motivated us and influenced the architecture design were the following: we strived to achieve maximal generality and extensibility of the architecture and wanted to cover all the steps in data integration process. We designed the architecture in a way that it allows for experimentation and comparison of various techniques that can be applied during data integration process. The list of components in the architecture (see Fig 2) includes: Crawler (source discovery module) used in the system preparation phase to discover source candidates. It performs Web forms parsing and provides preliminary description of the sources. Automatic source description module another preparation phase module, used to describe sources schema, content (in a more detailed way than the crawler) and its quality. This module is also responsible for providing navigation model for the source (how to reach the data through the Web interface, once we know what to ask about). This can be obtained either by learning and generalization. The last responsibility of this module is to create data extraction rules (by discovery).
September Issue
Page 7 of 75
International Journal of Computer Information Systems, Vol. 3, No. 3, 2011 distribution - certain components that may reside on several servers and be executed in parallel. Three classes of components are defined: Single single instance of the component in the system Multiple multiple instances of the component may exists (e.g. several parallel crawlers, several mediators handling queries with workload balancing in mind Multiple per query multiple instances of the component may exist to execute on query in distributed manner
[6] Preparation Phase Components Crawler Automat ic Source Mapp ing Modu le Sourc e descri ption Editor Sche ma Mapp ing editor [4] [5] [2] BENEVENTANO D., VINCINI M., GELATI G., GUERRA F., BERGAMASCHI S., Miks : An agent framework supporting information access and integration. AgentLink, 2003, 2249. BERGAMASCHI S., CASTANO S., VINCINI M., BENAVENTANO D., Semantic integration of heterogeneous information sources. Data & Knowledge Engineering, 36(3), 2001, 215249. Bergman M., The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1), 2001. CHAWATHE S., PAPAKONSTANTINOU Y., ULLMAN J., GARCIAMOLINA H., IRELAND K., HAMMER J., WIDOM J. The TSIMMIS project: Integration of heterogeneous information sources. 10th Meeting of the Information Processing Society of Japan, 1994, 718. CHEE C., ARENS Y., HSU C., KNOBLOCK C., Retrieving and integrating data from multiple informationsources. International Journal of Cooperative Information Systems, 2, 1993, 127158. DOAN A., DOMINGOS P., HALEVY A., DHAMANKAR R., LEE Y., IMAP: discovering complex semantic matches between database schemas. 2004 ACM SIGMOD International Conference onManagement of Data, 2004, 383394. DOMINGOS P., HALEVY A., DOAN A. Learning to match the schemas of databases: A multistrategy approach. Machine Learning, 2003. FLEJTER D., KACZMAREK T., KOWALKIEWICZ M., ABRAMOWICZ W., Deep web sources navigation. In Submitted to the 8th International Conference on Web Information Systems Engineering, 2007. GRAVANO L., PAPAKONSTANTINOU Y., Mediating and metasearching on the internet. IEEE Data Engineering Bulletin, 21(2), 1998, 2836. HALEVY A., RAJARAMAN A., ORDILLE J., Querying heterogeneous information sources using source descriptions. 22nd International Conference on Very Large Data Bases, 1996, 251262. HE B., PATEL M., ZHANG Z., CHEN-CHUAN CHANG K., Accessing the deep web. Communications of ACM, 50(5), 2007, 94 101. KACZMAREK T., Deep Web data integration for company environment analysis (in Polish). PhD thesis, Poznan University of Economics, 2006. KELLER A., GENESERETH M., DUSCHKA O., Infomaster: an information integration system. 1997 ACM SIGMOD International Conference on Management of Data, 1997, 539542. MELNIK S., PETROPOULOS M., BERNSTEIN P., QUIX C., Industrial-strength schema matching. SIGMOD Record, 33(4), 2004, 3843. ORDILLE J., RAJARAMAN A., HALEVY A., Data integration: The teenage years. 32nd International Conference on Very Large Data Bases, 2006. OZSU M., LIU L., LING YAN L., Accessing heterogeneous data through homogenization and integration mediators. 2nd International Conference on Cooperative Information Systems, 1997,130139. PAPAKONSTANTINOU Y., GUPTA A., HAAS L., Capabilities-based query rewriting in mediator systems. 4th International Conference on Parallel and Distributed Information Systems, 1996. RAMANATH M., Diaspora: A highly distributed web-query processing system. World Wide Web, 3(2), 2000, 111124. [20] RAPOSO J., L ARDAO L., MOLANO A., HIDALGO J., MONTOTO P., VINA A., ORJALES V., PAN A., ALVAREZ M., The DENODO data integration platform. 28th International Conference on Very Large Data Bases, 2002, 986 989. REYNAUD CH., GOASDOUE F., Modeling information sources for information integration. 11th European Workshop on Knowledge Acquisition, Modeling and Management, 1999, 121138. TEIXEIRA J., RIBEIRO-NETO B., LAENDER A., DA SILVA A., A brief survey of web data extraction tools. SIGMOD Record, 31(2), 2002, 8493. ZIEGLER P., DITTRICH K., Three decades of data integration - all problems solved? 18th IFIP World Computer Congress, 2004, 312.
[3]
Business level data logic Data Mediator Query Planner Wrapper executio n modules
Usages Phase Components Indivi dual Plugi ng imple menta tion Data excha -nge Form at & Mgt Comp onent
[7]
[8]
[9]
Source Discripti on
[10]
[11]
[12]
[15]
The Crawler component may have multiple instances and is extensible in a sense that it allows for pluggable source detectors and crawling-level data sources classifiers. The source description module can also be multiply instantiated. Its extension points include schema detectors (based on domain logic, types analysis etc.), content descriptor plug-ins (based on schema analysis, Web page analysis, data source probing etc.), quality measures (domain dependant and independent), and detectors of specific types of extraction rules (based on structure, context, visual parsing etc.). For schema mapping module single instance is sufficient but it can be extended with different schema matchers. There can be multiple instances of data mediators and query planners (which can have pluggable sources selectors, paired with corresponding content description plug-ins). The wrappers can be instantiated multiply per query and support pluggable data extractors that depend on particular data extraction rules used for the query. REFERENCES
[1] BENEVENTANO D., BERGAMASCHI S., The MOMIS methodology for integrating heterogeneous data sources. 18th IFIP World Computer Congress, 2004.
[16]
[17]
[18]
[19] [20]
[21]
[22]
[23]
September Issue
Page 8 of 75
Mr. Suneet kumar received his M.Tech degree from Rajasthan Vidyapeeth University, Rajasthan, India in 2006. and Persuing Ph.D degree from Bhagwant
September Issue
Page 9 of 75