Academia.eduAcademia.edu

Filling the gap between data federation and data integration

2004

Today's fast and continuous growth of large business organizations enforces the need to integrate and share data coming from a number of heterogeneous and distributed data sources. To solve this kind of problem, data federation tools suggest the adoption of a database management system as a kind of middleware infrastructure that uses a set of wrappers to access heterogeneous data sources. In this paper, we highlight federation tools limitations with respect to the information integration problem and show how data integration systems overcome such limitations. We propose then two different techniques to implement a data integration system using a data federation tool. Finally, we present experimental results that compare the two solutions.

Filling the gap between data federation and data integration Antonella Poggi and Marco Ruzzi Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” Via Salaria 113, I-00198 Roma, Italy {poggi,ruzzi}@dis.uniroma1.it Abstract. Today’s fast and continuous growth of large business organizations enforces the need to integrate and share data coming from a number of heterogeneous and distributed data sources. To solve this kind of problem, data federation tools suggest the adoption of a database management system as a kind of middleware infrastructure that uses a set of wrappers to access heterogeneous data sources. In this paper, we highlight federation tools limitations with respect to the information integration problem and show how data integration systems overcome such limitations. We propose then two different techniques to implement a data integration system using a data federation tool. Finally, we present experimental results that compare the two solutions. 1 Introduction Today’s fast and continuous growth of large business organizations, often deriving from mergers of smaller enterprises, enforces an increasing need in integrating and sharing large amounts of data, coming from a number of heterogeneous and distributed data sources. Nonetheless, such needs are also shown by others application fields, like information systems for administrative organizations, life sciences research and many others. More, is not unfrequent that different parts of the same organization, whatever it is, adopt different systems to produce and maintain its critical data. All these kinds of situations are concerned with the problem of information integration. Some recent software solutions for the problem suggest the adoption of a database management system (DBMS) as a kind of middleware infrastructure that uses a set of software modules, called wrappers, to access heterogeneous data sources [8]. Such wrappers hide the native characteristics of each source, masking them under the appearances of a common relational table. Furthermore, their aim is to mediate between the federated database and the sources, mapping the data model of each source to the federated database data model, also transforming operation over the federated database into requests that the source can handle. Such a technique takes the name of data federation and allows the users to exploit the full power of the SQL language, that the adopted DBMS supports, to combine the data coming from the sources. Therefore, the user can federate each kind of source for which an appropriate wrapper exists, involving different sources within a single SQL statement. Among the variety of research products, IBM DB2 Information Integrator (DB2II) [2] is the only data federation commercial tool we are aware of [7]. That is why it is the one we consider in the present paper. As we will better explain in the following, the features provided by data federation tools are often inadequate to obtain an acceptable solution to the problem of information integration. We will show how to make use of a data integration system to overcome these limitations, providing a flexible architecture that allows the user to uniformly access various sources and combine their data into a global unified view. To this aim, the user can define a global schema, i.e., a set of global concepts that represent its interest domain, explaining, by means of a mapping, the relationships that hold between these global entities and the underlying sources, represented within the data integration system by a source schema [11, 9]. To better represent its interest domain, the user can enrich the global schema with integrity constraints that the system takes into account while performing the requests [5, 3, 4]. The user interacts with the system by posing queries over the global schema, while the system carries out the task to suitably access the sources and retrieve the data that form the answer. In this paper we present an approach to the problem of information integration that applies data integration theory on a data federation environment. More specifically, assuming a relational framework, we define a formal specification of a data integration system, comprising a global schema, a mapping and a source schema, and we show how to implement this specification by means of a commercial tool for data federation. This is obtained by: (i) producing an instance of a federated database through the compilation of such a specification; (ii) translating the user queries posed over the global schema, so as to issue them to the federated database. To this aim, we propose two different compilation techniques: in both cases, we show correctness of the compilation with respect to the semantics of the system. Finally, we compare the two solutions by a set of experiments conducted using DB2II. The paper is structured as follows: in Section 2 we present the basic concepts regarding data federation, referring to the architecture of IBM DB2 Information Integrator and we analyze its limitations with respect to the problem of information integration; in Section 3 we introduce the classical logical framework for data integration; in Section 4 we show two different techniques for specification compilation and query rewriting, experimenting and comparing both techniques in Section 5; finally, we conclude the paper and present some future projects about the topic of this paper. 2 DB2II: IBM’s data federation tool Data federation tools enable data from multiple heterogeneous data sources to appear as if it was contained in a single federated database. Besides uniform access, performance and resource requirements are also considered in order to suitably face typical needs as joins or aggregations of different data sources. In this section, we provide an overview of a commercial solution. We choose DB2 Information Integrator (DB2II), since it is known to provide a very efficient federation of heterogeneous data sources [2]. Then, starting from a critics of DB2II, we discuss the limits of data federation with respect to data integration. 2.1 Overview of DB2II DB2II is the IBM’s solution to the data federation problem [8]. It provides a single uniform access to a large variety of heteregeneous data sources. More precisely, DB2II provides different wrappers (implemented by different libraries) for many popular data sources, such as relational sources (Oracle, MS SQL Server, ...) as well as semistructured XML sources and many common specialized data sources (such as an Excel table). Note that several homogeneous data sources can refer to the same wrapper. Then, inside a particular data source, data sets of interest are modeled as nicknames, that constitute basically a virtual view on the set of data. For example, inside a relational source, DB2II associates a nickname to each relation of interest in the source, while, for a non-relational XML source, DB2II associates a nickname to a fragment of the XML document characterized by an XPATH expression. The DB2II wrapper architecture enables the federation of heteregenous data sources, maintaining the state information for each of them, managing connections, decomposing queries into fragments that each source can handle and managing transactions across data sources. It supplies the infrastructure to model multiple data sets and multiple operations, letting specialized functions supported by the external source to be invoked in a query even though DB2 does not natively support the function. Furthermore , it includes a flexible framework to provide input to the query optimizer about the cost of data source operations and the size of data that will be returned, supplying at the same time information based on the immediate context of the query. Finally, wrappers participate as resource managers [8] in a DB2 transaction. 2.2 Limits of data federation Let us consider the DB2II’s approach to information integration presented above. As a federation tool, note that DB2II does not let the designer define an arbitrary description of the domain of interest, through which the users can access external sources. In particular, we have identified two main limitating features of the product: 1. As we already mentioned, DB2II models an external source data set as a nickname that consists of a virtual view on the data set. Therefore, DB2II establishes a oneto-one correspondence between source data sets and nicknames, instead of letting the designer define a correspondence between a concept of interest, represented as a unique relation, and a view over the multiple source data sets. 2. Furthermore, if the source is relational, the nickname schemas are identical to those of the modeled data source relations, which means that both have the same number, name and type of attributes. Even if the definition of nickname schemas is a little more flexible for non-relational data sources, there are still several rules that the designer has to follow, which limit the expressiveness of the correspondence even inside a single source data set. The mentioned limits are typical of data federation. Indeed, in this kind of tools, the designer is provided with a view of source data sets that is source-dependent, since it reflects basically the source data sets structure. In the next section, we present a formal approach to the problem of data integration showing how this theory can fulfil the gap enforced by such limitations. 3 A logical framework for data integration Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data [11, 12, 10]. The adoption of a logical formalism in specifying the global schema enables users to focus on specifying the intensional aspects of the integration (what they want), rather than thinking about the procedural aspects (how to obtain the answers). On the one hand, through the specification of a global schema, the domain of interest can be described independently of the data sources structure. On the other hand, the mapping models the relation between the global schema and the sources. The limits of data federation are therefore overcome by data integration systems. Such kind of approach, guarantees an higher level of flexibility, that will eventually lead to the extension of the system, by introducing, for example, integrity constraints over the global schema. In this section, we set up a logical framework for data integration. We first formalize the notion of a data integration system, then we specify its semantics. 3.1 Syntax We consider to have an infinite, fixed alphabet Γ of constants representing real world objects, and will take into account only databases having Γ as domain. Furthermore, we adopt the so-called unique name assumption, i.e., we assume that different constants denote different objects. Formally, a data integration system I is a triple hG, S, Mi , where: 1. G is the global schema expressed in the relational model. In particular, G is a set of relations, each with an associated arity that indicates the number of its attributes: G = {R1 (A1 ), R2 (A2 ), ..., Rn (An )}, where Ai is the sequence of attributes of Ri , for i = 1..n. 2. S is the source schema, constituted by the schemas of the various sources. We assume that the sources are relational. Dealing with only relational sources is not restrictive, since we can always assume that suitable wrappers (as the IBM Information Integrator wrappers) present sources in the relational format. In parSm ticular, S is the union of sets of data source relations Sj : S = j=1 Sj , with Sj = {rj1 (aj1 ), rj2 (aj2 ), ..., rjnj (ajnj )}, where ajh is the sequence of attributes of the h-th relation of the j-th source, for j = 1..m, h = 1..nj . 3. M is the mapping between the global and the source schema. In our framework, the mapping is defined in the Global-As-View (GAV) approach, i.e. each global relation in G is associated with a view, i.e., a query over the sources. We assume that views in the mapping are specified through conjunctive queries (CQ). We recall that a CQ q of arity n is a rule of the form: q(x1 , ..., xn ) ← conj(x1 , ..., xn , y1 , ..., ym ) where: conj(x1 , ..., xn , y1 , ..., ym ) is a conjunction of atoms involving the variables x1 , ..., xn , y1 , ..., ym and a set of constants of Γ . We denote the atom q(x1 , ..., xn ) as head(q), while body(q) denotes the set conj(x1 , ..., xn , y1 , ..., ym ). A GAV mapping is therefore a set of CQ q, where head(q) is a relation symbol of G and each atom in body(q) is a relation symbol of S. We assume that, for each relation symbol R in G, at most one CQ in M uses R in its head. Such a CQ will have the form: R(x) ← r1 (x1 , y1 ), ..rk (xk , yk ) where rh ∈ S for h = 1..k, and x = exists. Sk h=1 xh . We denote as ρ(R) such a CQ, if it Finally, a query over the global schema is a formula that is intended to extract a set of tuples of elements of Γ . We assume that the language used to specify queries is union of conjunctive queries (UCQ). We recall that a UCQ of arity n is a set of conjunctive queries Q such that each q ∈ Q has the same arity n and uses the same predicate symbol in the head. A query over I is a UCQ that only uses relation symbols of G in the body of the rules. An example of data integration system specification follows. Example 1. We build up a data integration system specification I considering three data sources: the first one stores information coming from the Registry Office concerning citizens and enterprises, the second one holds geographical information about the cities and their dislocation, the third one stores information coming from the Land Register about ownerships (buildings and lands). The source schema S that represents these sources within the system contains four relations: s1 s2 s3 s4 = citizen(ssn, name, citycode) = enterprise(ssn, name, citycode, emp number) = city(code, name, country) = ownerships(code, owner ssn, address, type) We want to extract from these sources, only information about owners and their buildings. We can define the global schema G with two relations R1 = owner(name, city) R2 = building(address, owner) and the mapping M can be defined as follows v1 = owner(X, Y ) ← citizen(W1 , X, Z), city(Z, Y, W2 ). v2 = owner(X, Y ) ← enterprise(W1 , X, Z, W2 ), city(Z, Y, W3 ). v3 = building(X, Y ) ← ownership(W1 , Y, X,′ building ′ ). A possible query q that asks for owners and address of buildings of Rome can be q = Q(X, Y ) ← owner(X,′ Rome′ ), building(Y, X). 3.2 Semantics A database instance (or simply database) C for a relational schema DB is a set of facts of the form r(t) where r is a relation of arity n in DB and t is an n-tuple of constants of Γ . We denote as rC the set {t | r(t) ∈ C} and with q C the result of the evaluation of the query q (expressed over DB) on C. Global Schema RELATIONS (virtual data) GAV Mappings Federated Schema NICKNAMES (virtual data) Wrappers Source Schema SOURCES DATA SETS (real data) Fig. 1. Levels of integration In order to assign semantics to a data integration system I = hG, S, Mi, we start considering a source database for I, i.e., a database D for the source schema S. We call global database for I any database for G. Given a source database D for hG, S, Mi the semantics sem of I w.r.t. D, sem(I, D), is the set of global databases B for I such that B satisfies M with respect to D. In particular, B is such that for each view ρM (R) in the GAV mapping M, all the tuples that satisfy ρM (R) in D satisfy the global schema D relation R, i.e. ρM (R) ⊆ RB . Finally, we specify the semantics of queries posed to a data integration system. Such queries are expressed in terms of the symbols in the global schema of I. Formally, given a source database D for I, we call certain answers q I,D to a query q with respect to I and D , the set of tuples t of objects in Γ such that t ∈ q B for every global database B for I with respect to D, i.e q I,D = {t | ∀B ∈ sem(I, D), t ∈ q B }. 4 Efficient data integration through data federation As illustrated in the previous section, data integration supplies a higher level of abstraction with respect to data federation. In particular, as shown in Figure 1, a data integration system can be considered as composed of: (i) a global schema, that constitutes the intensional description of data of interest, (ii) a source schema, that consists basically of a federated schema, (iii) a set of source data sets, that represent the real data, (iv) a set of GAV mappings that model the relation between the global relations and the federated relations and (v) a set of wrappers that implement the correspondence one-to-one between the federated relations and the source data sets. Such a scenario can be seen as a specialization of the well known wrapper-mediator architecture [13]. In this section, we first present an overview of our approach to realize efficient data integration system relying on a federation tool; then, we define two different techniques based on such an approach; finally, we discuss the correctness of both the techniques. Data Integration System COMPILER Query q TRANSLATOR DB2II Instance Query q’ Fig. 2. System architecture 4.1 Overview of our solution The solution we propose has the aim to implement an efficient data integration system that offers all the optimization techniques supplied by the federation tool, besides the expressiveness of the global schema. Note that, in the specification of the logical framework for data integration, we assume to deal with a relational source schema S. Indeed, through this assumption, we implicitly assume to take advantage of the mechanism of wrappers supplied by the federation tool. Let us use DB2II in order to implement a data integration system. We need first to build the source schema S, by creating a set of nicknames. Then, we need to implement two modules: 1. a compiler, that is responsible of compiling the data integration system I in a DB2II instance I ′ ; 2. a translator, that is responsible to translate a query Q over the data integration system I in a query Q′ over the corresponding DB2II instance I ′ . An overview of the architecture of the entire system is given in Figure 2. Suppose to start from a DB2II instance that contains all the nicknames in S. There are two possible techniques for the implementation of both modules presented above: – Internal technique: the idea is to rely upon the DB2II management of views. The compiler builds a DB2II instance that contains a view VR , for each global relation R. VR is defined by the query ρ(R) expressed in the data integration system mapping. – External technique: the idea is to focus the process on the user queries, rather than on the system specification, implementing an unfolding technique [13] that takes into account the correspondences expressed in the mappings. Note that, in the following, for the lack of space, we will omit the rules that specify the translation between the logic representation of both mapping and queries, and the corresponding SQL statement issued to the DBMS. 4.2 Internal technique: implementing data integration using DB2II views Since DB2II relies on DB2 UDB (Universal Database), the DBMS of IBM, a possible solution to realize data integration is to take advantage of the management of views of DB2II, in order to implement the correspondences expressed in the mapping. In particular, the DB2II instance Ii ′ that results from the compilation of an input data integration system I = hG, S, Mi is charaterized by a set of views V = {W }, where each view is represented as a couple W = hV (A), Ψ (V )i such that: – V (A) is the view schema, where A is a vector of attributes and V is the view name; – Ψ (V ) is the conjunctive query on S that defines the view; Ψ (V ) has the form: V (x) ← S1 (x1 , y1 ), ..Sk (xk , yk ) Sk where x = h=1 xh and the components of the vectors x and yh are variables or constants of the set Γ , for h = 1..k. The compilation algorithm compilei (I), shown in Figure 3, generates the set V of views of Ii ′ , by applying the following rule: if R(A) ∈ G, then insert into V the couple W = hVR (A), Ψ (VR )i, where: – ρ(R) = R(x) ← r1 (x1 , y1 ), ..rk (xk , yk ), – Ψ (VR ) = VR (x) ← r1 (x1 , y1 ), ..rk (xk , yk ). V := {}; for eachR(A) ∈ G do V := V ∪ hVR (A), VR ← body(ρ(R))i Fig. 3. compilei (I) algorithm. Now, after having created in the DB2II instance Ii ′ the set of views V, we need to provide a translation mechanism that, given a query a conjunctive Q over I, returns a corresponding query Qi ′ over the set of views V that belong to Ii ′ . The translation algorithm translatei (Q, I), shown in Figure 4, generates the result, by applying the following two syntactical rules: 1. the head of the query Qi ′ is identical to the head of the query Q; 2. for each atom R(x1 , y1 ) in the body of Q, insert the atom VR (x1 , y1 ) in the body of Qi ′ . Example 2. Referring to the system specification I exposed in Example 1, we show the compiled DB2II instance Ii ′ : hVowner (name, city) , {Vowner (X, Y ) ← citizen(W1 , X, Z), city(Z, Y, W2 )., Vowner (X, Y ) ← enterprise(W1 , X, Z, W2 ), city(Z, Y, W3 ).}i hVbuilding (address, owner) , Vbuilding (X, Y ) ← ownership(W1 , Y, X,′ building ′ ).i Note that the view definition for Vowner is a union of conjunctive queries. The translation for the user query Q to be posed on the defined views is Qi = Q(X, Y ) ← Vowner (X,′ Rome′ ), Vbuilding (Y, X). qaux := {}; for each g ∈ body(Q) such that g = R(xi , yi ) do qaux := qaux ∪ VR (xi , yi ) Q′ := head(Q) ← qaux Fig. 4. translatei (Q, I) algorithm. 4.3 External technique: external management of global schema and mapping In the second solution, we specify a technique for the implementation of a data integration system specification by means of external data structure. In short, we leave inside the DB2II instance Ie ′ only the federated schema S, that is, the set of nicknames that wrap the sources, maintaining, by appropriate outer structure, the global schema and the mapping between global relations and nicknames. Therefore, in this solution, the compiler behaves like the identity function, i.e. the DB2II instance Ie ′ is constituted only by S. As we said in the previous sections, the system user poses his queries on the global schema. In order to process user’s requests, we have to show a translation mechanism that allows the queries to be issued on the compiled DB2II instance Ie ′ . Informally, this may be done by substituting each atom appearing in the body of the query with the body of its corresponding mapping definition. This process can be seen as an extension of the well-known unfolding algorithm [13], and can be formalized as follows. Given a conjunctive query q over the relational symbols of G, where G is the global schema of I = hG, S, Mi, we define translated query for I ′ the query q ′ = translatee (q, I) in which q ′ = translatee (q, I) is a new query expressed over the relational symbols of S. qaux := q; for each g in body(q) v := mapping by head(g); σ := unify(g, head(v)); if (σ == {}) return NULL; else qaux := σ[replace(q, g, body(v))]; end if end for return qaux ; Fig. 5. translatee (q, I) algorithm In Figure 5 we show the unfolding algorithm we adopt for query reformulation. Such an algorithm makes use of some subroutines: mapping by head(g), that given an atom g returns the mapping view whose head contains g; unify(g1 ,g2 ), that unifies the variables of g1 and g2 , returning, if an unifier can be found, a set of substitutions σ that makes both the atoms equal. Moreover, the unfolding algorithm uses the subroutine replace(q,g,conj), that, given a conjunctive query q, one of its body atoms g, and a conjunction conj, replaces each atom g of the body of the query q with the conjunction conj. Example 3. We consider again the system specification I of Example 1. As we have said before, there is no need of compiling the specification, but we have to unfold the query q over the source relations. Based on the unfolding algorithm, we obtain the following union of conjunctive query Q′ (X, Y ) ← citizen(W1 , X, Z), city(Z, Y, W2 ), ownership(W3 , Y, X,′ building ′ ). ′ Q (X, Y ) ← enterprise(W1 , X, Z, W2 ), city(Z, Y, W3 ), ownership(W4 , Y, X,′ building ′ ). 4.4 Correctness So far we have proposed two different techniques that allow the implementation of a data integration system in DB2II. Compilation and translation algorithms basically consist, in both the techniques, of syntactical transformations. The following theorems show the correctness of both the solutions (we omit the demonstrations for space reasons): Theorem 1. Given a data integration system specification I = hG, S, Mi, a database instance D for I, a conjunctive query Q over G and a tuple t̄, t̄ ∈ q I,D if and only if t̄ belongs to the set of answers we obtain by querying the compiled DB2II instance Ii ′ by means of the query Qi ′ , where Ii ′ is the DB2II instance obtained from the compilation of I by means of the algorithm compilei (I), and Qi ′ is the query obtained from the translation obtained of Q translation by means of the algorithm translatei (Q, I). Theorem 2. Given a data integration system specification I = hG, S, Mi, a database instance D for I, a conjunctive query Q over G and a tuple t̄, t̄ ∈ q I,D if and only if t̄ belongs to the set of answers we obtain by querying the DB2II instance Ie ′ , where Ie ′ is constituted only by S, and Qe ′ is the query obtained from the translation obtained from Q by means of the translatee (Q, I) algorithm. 5 Experimental results In order to test the feasibility of both the techniques presented in the previous sections, we have carried out some experiments on real data, coming from the University of Rome “La Sapienza”. The experiment scenario comprises three data sources significantly overlapping: (i) a DB2 database instance holding administrative information, i.e. registry, exams and career information about students (204014 tuples)); (ii) a Microsoft SQLServer database instance storing information about exam plan of students (28790 tuples); (iii) some XML documents containing information about exams of students of the computer science diploma course (5836 tuples). Note that the number of tuples refers to the federated tables, resulting from wrapping. (a) Query execution times (b) Percentage comparisons Fig. 6. Data integration through data federation Starting from these sources, we have built a GAV data integration system whose specification, that we omit for the lack of space, contains 10 global relations and 27 source relations. In order to test the efficiency of our solutions regardless of network traffic and delays, we have carried out the experiments on local instances of the data, so as to have a truthful comparison of both the techniques. We have conducted the experiments on an Double Intel Pentium IV Xeon machine, with 3 GHz processor clock frequency, equipped with 2 GB of RAM memory and a 30 GB SCSI hard disk at 7200 RPM; the machine runs the Windows XP operating system. We have run 5 test queries on both the specifications: one obtained by the internal technique presented in Section 4.2 and the other one generated with the external technique proposed in Section 4.3. We observe that the internal technique based on DB2II views definition is faster than the external one, which implements a query rewriting algorithm. Furthermore, the gain obtained with the first technique is major for queries that present an higher number of joins. This is due to the capability of the DB2II query engine to efficiently process views, taking into account source statistics and some access optimization. Comparative execution times are shown in Figure 6, where the percentage difference between the lower and the faster technique is also presented. 6 Conclusions and future work The goal of this paper was to propose a novel approach to realize a data integration system, by means of a commercial data federation tool (DB2II). Summarizing, we have provided the following contributions: (i) we have highlighted the limits of data federation tools with respect to information integration; (ii) we have proposed two different compilation techniques to overcome such limitations; (iii) we have presented experimental results that compare the two approaches. The proposed solutions take advantage of the tool mechanism of wrappers, besides offering the expressiveness typical of a data integration system. Two fully running versions of our system have been implemented that follow different techniques. Through a set of experiments we compared the two techniques, in order to evaluate which solution takes the most of the tool optimization techniques. In particular, we obtained better results using the internal management of views of DB2II, instead of representing the mappings by appropriate external data structures. Nevertheless, we notice that the external technique may be better if the designer wants to take advantage of the full power of the logic approach, keeping the opportunity, for example, to exploit some recent techniques for the treatment of incomplete and inconsistent information [11]. In fact, this work represents the first step in the implementation of a data integration system through data federation tool. Among the perspectives of our work, we plan to extend our approach by enhancing the modeling power of the system, for example by means of the possibility of specifying typing constraints or integrity constraints over the global schema. Moreover, we plan to investigate how to rely upon a commercial federation tool, following the Local-As-View (LAV) approach to data integration [1, 6], and, finally, offering different languages for the specification of the global schema, as object-oriented or semi-structured languages. References 1. Serge Abiteboul and Oliver Duschka. Complexity of answering queries using materialized views. In Proc. of the 17th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS’98), pages 254–265, 1998. 2. Paolo Bruni, Francis Arnaudies, Amanda Bennett, Susanne Englert, and Gerhard Keplinger. Data federation with IBM DB2 Information Integrator V8.1, 2003. Redbook, http:// www.redbooks.ibm.com/redbooks/pdfs/sg247052.pdf. 3. Andrea Calı̀, Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Data integration under integrity constraints. Information Systems, 2003. To appear. 4. Andrea Calı̀, Domenico Lembo, and Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In Proc. of the 22nd ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2003), pages 260–271, 2003. 5. Oliver M. Duschka, Michael R. Genesereth, and Alon Y. Levy. Recursive query plans for data integration. J. of Logic Programming, 43(1):49–73, 2000. 6. Gösta Grahne and Alberto O. Mendelzon. Tableau techniques for querying information sources through global schemas. In Proc. of the 7th Int. Conf. on Database Theory (ICDT’99), volume 1540 of Lecture Notes in Computer Science, pages 332–347. Springer, 1999. 7. L.M. Haas. A researcher’s dream. DB2 Magazine, 8(3):34–40, 2003. 8. L.M. Haas, E.T. Lin, and M.A. Roth. Data integration through database federation. IBM Systems Journal, 41(4):578–596, 2002. 9. Alon Y. Halevy. Answering queries using views: A survey. Very Large Database J., 10(4):270–294, 2001. 10. Richard Hull. Managing semantic heterogeneity in databases: A theoretical perspective. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS’97), pages 51–61, 1997. 11. Maurizio Lenzerini. Data integration: A theoretical perspective. In Proc. of the 21st ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2002), pages 233–246, 2002. 12. Alon Y. Levy. Logic-based techniques in data integration. pages 575–595. Kluwer Academic Publishers, 2000. 13. Jeffrey D. Ullman. Information integration using logical views. In Proc. of the 6th Int. Conf. on Database Theory (ICDT’97), volume 1186 of Lecture Notes in Computer Science, pages 19–40. Springer, 1997.