Academia.eduAcademia.edu

Schema management for large-scale multidatabase systems

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely afiect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Schema management for largescale multidatabase systems Item Type text; Dissertation-Reproduction (electronic) Authors Wei, Chih-Ping, 1965- Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Download date 21/01/2022 06:29:57 Link to Item http://hdl.handle.net/10150/290610 INFORMATION TO USERS This manuscnpt has been reproduced from the microfihn master. UMI films the t^ directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter fece, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely afiect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directiy to order. UMI A Bell & Howell Information Company 300 North Zedb Road, Ami Aibor MI 48106-1346 USA 313/761-4700 800/521-0600 SCHEMA MANAGEMENT FOR LARGE-SCALE MULTIDATABASE SYSTEMS by Chih-Ping Wei Copyright © Chih-Ping Wei 1996 A Dissertation Submitted to the Faculty of the COMMITTEE ON BUSINESS ADMINISTRATION In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY WITH A MAJOR IN MANAGEMENT In the Graduate College THE UNIVERSITY OF ARIZONA 1996 UMI Number: 9713368 Copyright 1996 byWei, Chih-Ping All rights reserved. UMI Microform 9713368 Copyright 1997, by UMI Company. Ail rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 2 THE UNIVERSITY OF ARIZONA ® GRADUATE COLLEGE As members of the Final Examination Committee, we certify that we have read the dissertation prepared by Chih-Ping Wei entitled Schema Management for Large-Scale Multidatabase Systems and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy Anindya Jgatta, Ralph Martinez Date Final approval and acceptance of this dissertation is contingent upon the candidate's submission of the final copy of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. ation 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under the rules of the library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder. SIGNED: 4 ACKNOWLEDGEMENTS I would like to thank all my committee members. Dr. Olivia R. Liu Sheng, Dr. Hsinchun Chen, Dr. Anyndia Datta, Dr. Ralph Martinez, and Dr. Bernard P. Zeigler, who have been helpful in many ways. My deepest thanks go to my dissertation advisor. Dr. Olivia R. Liu Sheng, who is an incredibly talented, thoughtful and inspiring individual. I am extremely fortunate to have Dr. Sheng as my mentor; she has always been there for me and has helped me over many hurdles during my research development. This dissertation would not have achieved such high quality without her belief in my capabilities and her guiding me to work to my potential. I am deeply indebted to my close friends, Paul Jen-Hwa Hu, Pony Huei-Hwa Ma, HsiaoTang Chang, Yiming Chung and Ming-Hsuan Yang, who went above and beyond the call of friendship to help me finish up the dissertation and put life in the right perspective when the going got tough. My gratitude to them is beyond words. Moreover, it is impossible to measure my indebtedness to my girlfriend, Mei-Yun Wang. She has been remarkably understanding and supportive in everything 1 pursue. I honestly believe that I would not have achieved what I have without unceasing love, encouragement and trust from my parents and my sisters. I dedicate this work to those who love me and make my life blessed with hope, joy and wonders. 5 DEDICATION To my grandfather who passed away in 1995 with an unrealized life-time wish whose realization comes nine months too late. 6 TABLE OF CONTENTS LIST OF FIGURES 9 LIST OF TABLES 11 LIST OF ALGORITHMS. 12 ABSTRACT 13 1. INTRODUCTION 15 1.1 BACKGROUND 1.2 OVERVIEW OF MULTIDATABASE SYSTEMS 1.2.1 CHARACTERISTICS OF MULTIDATABASE SYSTEMS 1.2.2 OBJECTIVES OF MULTIDATABASE SYSTEMS 1.2.3 SCHEMA ARCHITECTURE OF MULTIDATABASE SYSTEMS 1.2.4 RESEARCH ISSUES IN MULTIDATABASE SYSTEMS 1.3 ISSUES OF SCHEMA MANAGEMENT FOR MULTIDATABASE SYSTEMS 1.4 RESEARCH MOTIVATION AND OBJECTIVE 1.5 ORGANIZATION OF THE DISSERTATION 15 18 18 20 23 26 29 32 35 2. LITERATURE REVIEW AND RESEARCH FORMULATION. 37 2.1 LITERATURE REVIEW ON MODEL AND SCHEMA TRANSLATION 2.1.1 FORMULATION OF MODEL AND SCHEMA TRANSLATION PROBLEMS 2.1.2 APPROACHES TO MODEL TRANSLATION 2.2 LITERATURE REVIEW ON SCHEMA INTEGRATION 2.2.1 FORMULATION OF SCHEMA INTEGRATION PROBLEM 2.2.2 TAXONOMY OF CONFLICTS 2.2.3 ANALYSIS OF CONFLICT TYPES AND SCHEMA INTEGRATION PROCESS 2.3 RESEARCH FOUNDATION 2.3.1 SEMANTICS CHARACTERISTICS OF DATA MODELS 2.3.2 IMPLICATION TO SCHEMA NORMALIZATION 2.3.3 IMPLICATION TO MODEL AND SCHEMA TRANSLATION 2.3.4 SUMMARY OF RESEARCH FOUNDATION 2.4 RESEARCH FRAMEWORK AND TASKS 2.5 RESEARCH QUESTIONS 37 37 39 44 44 46 47 49 50 53 55 61 63 65 3. METAMODEL DEVELOPMENT 68 3.1 MODELING PARADIGM AND BEYOND 3.1.1 MODELING PARADIGM 68 68 7 3.1.2 COMPONENTS OF A MODEL 3.1.3 METAMODELING PARADIGM. 3.1.4 REQUIREMENTS OF METAMODEL 3.1.5 EXTENDED METAMODELING PARADIGM 3.2 METAMODEL: SYNTHESIZED OBJECT-ORIENTED ENTITY-RELATIONSHIP MODEL 3.2.1 CONSTRUCTS OF THE SOOER MODEL 3.2.2 CONSTRAINT SPECIFICATION IN THE SOOER MODEL 3.3 METAMODEL SCHEMA AND MODEL DEFINITION LANGUAGE 3.3.1 METAMODEL CONSTRUCTS IN METAMODEL SCHEMA. 3.3.2 EXPLICIT METAMODEL CONSTRAINTS IN METAMODEL SCHEMA 3.3.3 METAMODEL SEMANTICS IN METAMODEL SCHEMA 3.3.4 IMPLICIT METAMODEL CONSTRAINTS IN METAMODEL SCHEMA 3.3.5 MODEL DEFINITION LANGUAGE 3.4 METAMODELING PROCESS AND MODEL SCHEMA 69 71 74 74 77 79 84 90 90 93 95 100 102 104 4. INDUCTIVE METAMODELING 105 4.1 OVERVIEW 4.1.1 CHARACTERISTICS OF INDUCTIVE METAMODELING PROBLEM 4.1.2 COMPARISONS WITH EXISTING INDUCTIVE LEARNING TECHNIQUES 4.2 ABSTRACTION INDUCTION TECHNIQUE FOR INDUCTIVE METAMODELING 4.2.1 CONCEPT DECOMPOSITION PHASE 4.2.1.1 CONCEPT HIERARCHY CREATION 4.2.1.2 CONCEPT HIERARCHY ENHANCEMENT 4.2.1.3 CONCEPT GRAPH MERGING 4.2.1.4 CONCEPT GRAPH PRUNING 4.2.2 CONCEPT GENERALIZATION PHASE 4.2.2.1 GENERALIZATION OF PROPERTY NODES 4.2.2.2 GENERALIZATION OF LEAF CONCEPT NODES 4.2.2.3 GENERALIZATION OF NON-LEAFCONCEPT NODES AND IMMEDIATE DOWNWARD HAS-A LINKS 4.2.2.4 GENERALIZATION OF REFER-TO LINKS 4.2.3 CONSTRAINT GENERATION PHASE 4.3 TIME COMPLEXITY ANALYSIS OF ABSTRACTION INDUCTION TECHNIQUE 4.4 EVALUATION OF ABSTRACTION INDUCTION TECHNIQUE 105 105 109 113 114 118 123 129 133 135 138 143 145 158 165 171 173 5. CONSTRUCT EQUIVALENCE ASSERTION LANGUAGE 183 5.1 DESIGN PRINCIPLES FOR A CONSTRUCT EQUIVALENCE REPRESENTATION 183 5.2 DEVELOPMENT OF CONSTRUCT EQUIVALENCE ASSERTION LANGUAGE 186 5.2.1 HIGH-LEVEL SYNTAX STRUCTURE OF CONSTRUCT EQUIVALENCE ASSERTION LANGUAGE... 187 5.2.2 DEFINITION OF CONSTRUCT-SET 188 5.2.3 DEFINITION OF CONSTRUCT-CORRESPONDENCE 199 5.2.4 DEFINITION OF ANCILLARY-DESCRIPTION 204 8 5.2.5 EXECUTION SEMANTICS OF CONSTRUCT EQUIVALENCES 5.3 INFORMATION PRESERVING CONSTRUCT EQUIVALENCE TRANSFORMATION FUNCTION 5.4 EVALUATION OF CONSTRUCT EQUIVALENCE ASSERTION LANGUAGE AGAINST DESIGN PRINCIPLES 207 212 228 6. CONSTRUCT-EQUIVALENCE-BASED SCHEMA TRANSLATION AND SCHEMA NORMALIZATION 232 6.1 CONSTRUCT EQUIVALENCE TRANSFORMATION METHOD 6.2 CONSTRUCT EQUIVALENCE REASONING METHOD 6.3 CONSTRUCT-EQUIVALENCE-BASED SCHEMA TRANSLATION 6.3.1 ALGORITHM: CONSTRUCT-EQUIVALENCE-BASED SCHEMA TRANSLATION. 6.3.2 ADVANTAGES OF CONSTRUCT-EQUIVALENCE-BASED SCHEMA TRANSLATION 6.4 CONSTRUCT-EQUIVALENCE-BASED SCHEMA NORMALIZATION 232 243 251 251 252 255 7. CONTRIBUTIONS AND FUTURE RESEARCH 257 7.1 CONTRIBUTIONS 7.1.1 CONTRIBUTIONS TO MDBS RESEARCH 7.1.2 CONTRIBUTIONS TO OTHER RESEARCH AREAS 7.2 FUTURE RESEARCH DIRECTIONS 257 258 261 261 APPENDICES 265 A. RELATIONSHIPS BETWEEN SYNTHESIZED TAXONOMY OF CONFLICTS AND OTHER TAXONOMIES B. MODEL SCHEMA OF A RELATIONAL DATA MODEL C. COMMON SEPARATORS AND THEIR IMPLICATIONS TO CONCEPT HIERARCHY CREATION IN ABSTRACTION INDUCTION TECHNIQUE D. EVALUATION STUDY 1: RELATIONAL MODEL SCHEMA INDUCED FROM A UNIVERSITY HEALTH CENTER DATABASE E. EVALUATION STUDY 2: NETWORK MODEL SCHEMA INDUCED FROM A HYPOTHETICAL COMPANY DATABASE F. EVALUATION STUDY 3: HIERARCHICAL MODEL SCHEMA INDUCED FROM A HYPOTHETICAL COMPANY DATABASE G. INTER-MODEL CONSTRUCT EQUIVALENCES BETWEEN THE EER AND SOOER MODELS H. INTRA-MODEL CONSTRUCT EQUIVALENCES OF THE SOOER MODEL REFERENCES 265 266 269 270 275 280 286 291 295 9 LIST OF FIGURES FIGURE 1.1: ENVIRONMENT OF MULTIDATABASE SYSTEM 17 FIGURE 1.2: REFERENCE SCHEMA ARCHITECTURE OF AN MDBS 24 FIGURE 1.3: SCHEMA MANAGEMENT ISSUES ON THE FIVE-LEVEL SCHEMA ARCHITECTURE 30 FIGURE 2.1: SCHEMA INTEGRATION PROCESS 45 FIGURE 2.2: PROPOSED SCHEMA INTEGRATION PROCESS 49 FIGURE 2.3: NON-ORTHOGONAL MODEL CONSTRUCTS OF DATA MODEL 51 FIGURE 2.4: OVERLAPPED SEMANTIC SPACES OF DATA MODEL T AND S 52 FIGURE 2.5: CONCEPTUAL FLOW OF CONSTRUCT-EQUIVALENCE-BASED SCHEMA NORMALIZATION 54 FIGURE 2.6: SOURCE OF SEMANTIC LOSS AND DOMAIN FOR SEMANTIC ENHANCEMENT .. 56 FIGURE 2.7: EXAMPLE OF INTRA-MODEL AND INTER-MODEL CONSTRUCT EQUIVALENCES 59 FIGURE 2.8: CONCEPTUAL FLOW OF SCHEMA TRANSLATION BASED ON INTRA-MODEL AND INTER-MODEL CONSTRUCT EQUIVALENCES 60 FIGURE 2.9: ARCHITECTURAL FRAMEWORK FOR CONSTRUCT-EQUIVALENCE-BASED METHODOLOGY FOR SCHEMA TRANSLATION AND SCHEMA NORMALIZATION 65 FIGURE 3.1: MODELING PARADIGM 69 FIGURE 3.2: METAMODELING PARADIGM 72 FIGURE 3.3: SCHEMA HIERARCHY IN METAMODELING PARADIGM 73 FIGURE 3.4: EXTENDED METAMODELING PARADIGM 75 FIGURE 3.5: SCHEMA HIERARCHY IN EXTENDED METAMODELING PARADIGM 77 FIGURE 3.6: GRAPHICAL NOTATIONS OF THE SOOER MODEL CONSTRUCTS 80 FIGURE 3.7: METAMODEL CONSTRUCTS IN METAMODEL SCHEMA 91 FIGURE 3.8: MODEL DEFINITION LANGUAGE BASED ON THE SOOER METAMODEL 103 FIGURE 3.9: MODEL SCHEMA OF RELATIONAL DATA MODEL 104 FIGURE 4.1: EXAMPLE OF CONCEPT GRAPH 117 FIGURE 4.2: STEPS OF CONCEPT DECOMPOSITION PHASE 118 FIGURE 4.3: CONCEPT HIERARCHIES (EXAMPLE 1,2 AND 3) AFTER CONCEPT HIERARCHY CREATION 123 FIGURE 4.4: CONCEPT GRAPHS (EXAMPLE 1 AND 2) AFTER CONCEPT HIERARCHY ENHANCEMENT 128 FIGURE 4.5: CONCEPT GRAPHS (EXAMPLE 1,2 AND 3) AFTER CONCEPT HIERARCHY ENHANCEMENT 129 FIGURE 4.6: CONCEPT GRAPHS (EXAMPLE 1.1 AND 1.2) BEFORE CONCEPT GRAPH MERGING 130 FIGURE 4.7: CONCEPT GRAPH (EXAMPLE 1.1 WITH 1.2) AFTER CONCEPT GRAPH MERGING 131 FIGURE 4.8: CONCEPT GRAPHS (EXAMPLE 1 AND 2) AFTER CONCEPT GRAPH MERGING 133 10 FIGURE 4.9: CONCEPT GRAPHS (EXAMPLE 1 AND 2) AFTER CONCEPT GRAPH PRUNING 135 FIGURE 4.10: HIGH-LEVEL VIEW OF CONCEPT GENERALIZATION PHASE 137 FIGURE 4.11: EXAMPLE OF PROPERTY GENERALIZATION HIERARCHY 139 FIGURE 4.12: CONCEPT GRAPHS (EXAMPLE 1 AND 2) AFTER GENERALIZATION OF PROPERTY NODES 143 FIGURE 4.13: CONCEPT GRAPHS (EXAMPLE I AND 2) AND MODEL SCHEMA AFTER GENERALIZATION OF LEAF CONCEPT NODES 145 FIGURE 4.14: EXAMPLE OF SETS OF SIMILAR NON-LEAF CONCEPT NODES 148 FIGURE 4.15: SIMILARITY GRAPH FOR ALL NON-LEAF CONCEPT NODES OF EXAMPLE 1 AND 2 155 FIGURE 4.16: CONCEPT GRAPHS (EXAMPLE 1 AND 2) AND MODEL SCHEMA AFTER GENERALIZATION OF NON-LEAF CONCEPT NODES AND HAS-A LINKS 158 FIGURE 4.17: MODEL SCHEMA AFTER GENERALIZATION OF REFER-TO LINKS 164 FIGURE 4.18: RELATIONAL MODEL SCHEMA INDUCED FROM UNIVERSITY HEALTH CENTER DATABASE 174 FIGURE 4.19: REFERENCE NETWORK MODEL SCHEMA 178 FIGURE 4.20: NETWORK MODEL SCHEMA INDUCED FROM HYPOTHETICAL COMPANY DATABASE 178 FIGURE 4.21: REFERENCE HIERARCHICAL MODEL SCHEMA 180 FIGURE 4.22: HIERARCHICAL MODEL SCHEMA INDUCED FROM HYPOTHETICAL COMPANY DATABASE 180 FIGURE 5.1: PROCESS OF TRANSFORMING CONSTRUCT EQUIVALENCE OF EXAMPLE 5....216 FIGURE 5.2: CONSTRUCT EQUIVALENCE TRANSFORMED FROM EXAMPLE 5 218 FIGURE 5.3: PROCESS OF TRANSFORMING CONSTRUCT EQUIVALENCE OF EXAMPLE 6....220 FIGURE 5.4: CONSTRUCT EQUIVALENCE TRANSFORMED FROM EXAMPLE 6 221 FIGURE 5.5: CONSTRUCT EQUIVALENCE TRANSFORMED FROM EXAMPLE 5 226 FIGURE 5.6: CONSTRUCT EQUIVALENCE TRANSFORMED FROM EXAMPLE 5 228 11 LIST OF TABLES TABLE 1.1: CHALLENGES TO RESEARCH ISSUES OF MDBS 27 TABLE 1.2: REQUIRED SUPPORT FOR DIFFERENT TYPES OF SCHEMA EVOLUTION 32 TABLE 3.1: FUNCTIONS FOR FUNCTION EXPRESSION IN SOOER CONSTRAINT SPECIFICATION 89 TABLE 3.2: EXPLICIT METAMODEL CONSTRAINTS IN METAMODEL SCHEMA 94 TABLE 3.3: METAMODEL SEMANTICS IN METAMODEL SCHEMA 97 TABLE 4.1: SUMMARY OF CHARACTERISTICS OF INDUCTIVE METAMODELING PROCESS 106 TABLE 4.2: SUMMARY OF EXISTING INDUCTIVE LEARNING TECHNIQUES 112 TABLE 4.3: MODEL CONSTRAINTS (EXAMPLE 1 AND 2) AFTER CONSTRAINT GENERATION 171 TABLE 4.4: TIME COMPLEXITY OF EACH STEP IN ABSTRACTION INDUCTION TECHNIQUE 172 TABLE 4.5: SUMMARY OF EVALUATION STUDY 1 (RELATIONAL MODEL SCHEMA) 176 TABLE 4.6: PRECISION RATES OF ABSTRACTION INDUCTION TECHNIQUE IN EVALUATION STUDY 1 177 TABLE 4.7: SUMMARY OF EVALUATION STUDY 2 (NETWORK MODEL SCHEMA) 179 TABLE 4.8: SUMMARY OF EVALUATION STUDY 3 (HIERARCHICAL MODEL SCHEMA) 181 TABLE 4.9: SUMMARY OF THREE EVALUATION STUDIES 182 TABLE 5.1: SOURCES FOR ANCILLARY-DESCRIPTION WHEN EXCHANGING CONSTRUCTSETS 205 TABLE 5.2: SUMMARY OF EXECUTION SEMANTICS OF CONSTRUCT EQUIVALENCE 212 TABLE 5.3: RESTRUCTURING OPERATIONS FOR ANCILLARY CLAUSE AND CONSTRUCTCORRESPONDENCE 214 TABLE 5.4: RESTRUCTURING OPERATIONS FOR COMPLEX CONSTRUCT-DOMAINS AND SELECTION CLAUSES ON LHS CONSTRUCT-INSTANCES (OR CONSTRUCT-INSTANCESETS) 215 TABLE 5.5: INVERSIBILITY OF RESTRUCTURING OPERATIONS OF THE CONSTRUCT EQUIVALENCE TRANSFORMATION FUNCTION 223 TABLE 5.6: EVALUATION OF CONSTRUCT EQUIVALENCE ASSERTION LANGUAGE 231 12 LIST OF ALGORITHMS ALGORITHM 3.1: ALGORITHM OF METAMODEL SEMANTICS INSTANTIATION ALGORITHM 4.1: GENERALIZATION OF NON-LEAF CONCEPT NODES AND DOWNWARD HAS-A LINKS ALGORITHM 4.2: CONSTRAINT GENERALIZATION PROCESS ALGORITHM 6.1: INTER-MODEL CONSTRUCT EQUIVALENCE TRANSFORMATION ALGORITHM 6.2: INTRA-MODEL CONSTRUCT EQUIVALENCE TRANSFORMATION FOR SOURCE DATA MODEL ALGORITHM 6.3: INTRA-MODEL CONSTRUCT EQUIVALENCE TRANSFORMATION FOR TARGET DATA MODEL ALGORITHM 6.4: INTRA-MODEL CONSTRUCT EQUIVALENCE TRANSFORMATION FOR SCHEMA NORMALIZATION ALGORITHM 6.5: INTRA-MODEL CONSTRUCT EQUIVALENCE REASONING METHOD ALGORITHM 6.6: CONFLICT RESOLUTION ALGORITHM 6.7: INTER-MODEL CONSTRUCT EQUIVALENCE REASONING METHOD ALGORITHM 6.8: ALGORITHM FOR CONSTRUCT-EQUIVALENCE-BASED SCHEMA TRANSLATION ALGORITHM 6.9: ALGORITHM FOR CONSTRUCT-EQUIVALENCE-BASED SCHEMA TRANSLATION 101 150 168 233 236 240 242 247 248 250 252 256 13 ABSTRACT Advances in networking and database technologies have made the concept of global infomiation sharing possible. A rapidly growing number of applications require access to and manipulation of the data residing in multiple pre-existing database systems, which are usually autonomous and heterogeneous. A promising approach to the problems of interoperating multiple heterogeneous database systems is the construction of multidatabase systems. Among all of the research issues concerning multidatabase systems, schema management which involves with the management of various schemas at different levels in a dynamic environment has been largely overlooked in the previous research. Two most important research in schema management have been identified: schema translation and schema integration. The need for declarative and extensible approach to schema translation and the support for schema integration are accentuated in a large-scale environment. This dissertation presents a construct-equivalence-based methodology based on the implications of semantics characteristics of data models for schema translation and schema integration. The research was undertaken for the purposes of I) overcoming the methodological inadequacies of existing schema translation approaches and the conventional schema integration process for large-scale MDBSs, 2) providing an integrated methodology for schema translation and schema normalization whose similarities of problem formulation has not been previously recognized, 3) inductively learning model schemas that provide a basis for declaratively specifying construct equivalences for schema translation and schema normalization. The methodology is based on a metamodel (Synthesized Object-Oriented Entity-Relationship (SOGER) model), an inductive metamodeling approach (Abstraction Induction Technique), a declarative construct equivalence representation (Construct Equivalence Assertion Language, CEAL), and its associated transformation and reasoning methods. The results of evaluation studies showed that Abstraction Induction Technique inductively learned satisfactory model schemas. CEAL's expressiveness and adequacy m meeting its design principles, well-defined construct equivalence transformation and reasoning methods, as well as the advantages realized by the construct-equivalence-based schema translation and schema normalization suggested that the construct-equivalence-based methodology be a promising approach for large-scale MDBSs. 15 CHAPTER 1 Introduction 1.1 Background For historical reasons and because different applications can be better supported by different types of database management systems (DBMSs) which employ different data models with different query languages, a variety of heterogeneous and autonomous local database systems (LDBSs) are currently in use in organizations today [MY95]. Each LDBS manages data for its applications (called local applications) autonomously and usually is not accessible to other applications which are not pertaining to the LDBS. As organizations and users become more sophisticated, they increasingly demand both access to and ability to manipulate data in multiple pre-existing heterogeneous and autonomous LDBSs without loss of local autonomy [BHP92]. This demand for global information sharing arises in two contexts: complementary databases and competing databases [FN93, Z94, ZSC95]. In the context of complementary databases within a single or multiple organizations, a LDBS covering a subarea of a domain of interest is a complement of other LDBSs, each of which captures only certain aspect of the domain. Information concerning the same reality in the domain therefore is scattered over different LDBSs. Examples of global information sharing in the complementary context include computer-integrated manufacturing (CIM) [USR93, WD92, DW91], health-care [LWH94a, LWH94b], and office information environments [HM85]. Without the 16 capability of access to and manipulation on data in all these complementary databases, complete information of the same reality is hard to obtain and global data consistency among LDBSs is difficult to achieve. On the other hand, in the context of competing databases, LDBSs are similar in content but serve competing business interests. Electronic commerce provides a typical example of global information sharing in the competing context [MYB87, BW95]. Competing databases once were considered proprietary by their owning organizations. Now, however, organizations are increasingly willing to grant external users (e.g., customers, suppliers, etc.) access to part of their databases in order to maintain competitive advantages and broaden promotion charmels. Consequently, global information sharing on competing databases provides a global information retrieval platform on which users can easily search for information in different databases or seek services provided by different organizations. Applications for global information sharing (called global applications) are required to access and manipulate data from multiple pre-existing heterogeneous and autonomous database systems. Without any global support or coordination, the development of such global applications requires global application developers to have comprehensive knowledge of the schemas being managed by LDBSs, be able to resolve any schema or system heterogeneity, and be able to use various database systems having different data models and query languages, etc. The performance and effectiveness (e.g., quality of query results and global data consistency) of global applications therefore depends heavily on the extensive knowledge of global application developers. To facilitate the development of global applications, it has been suggested that the most viable and 17 general solution is a multidatabase system (MDBS) [BHP92, K95]. An MDBS is a database system that resides unobtrusively on top of existing LDBSs and presents a single database illusion to global applications [K95], as shown in Figure 1.1. Its unobtrusiveness means that an MDBS should not require any change to an existing LDBS or any existing local applications. Moreover, an MDBS should not limit global applications orJy to retrieval operations. Both retrievals and updates on LDBSs from global applications should be supported by an MDBS. Global Application global transaction ] r result of global transaction Multidatabase System (MDBS) Q subtransactions results of subtransactions communication network Local Application subtransaction result local transactions ions ^ result of subtransaction \ Local Database System (LDBS) subtransaction result of subtransaction result;^ Local Database System (LDBS) Local Application ' ^^ocal tr transactions Figure 1.1: Environment of Multidatabase System As shown in Figure 1.1, a global transaction in an MDBS environment is a transaction executed under the MDBS's control and is decomposed by the MDBS into a set of subtransactions, each of which is an ordinary local transaction from the point of view of the LDBS where the subtransaction is executed. Afterward, the execution results of subtransactions will be rendered back to and consolidated in the MDBS, which in turn 18 returns the consolidated result to its requesting global application. At the same time, local applications are still preserved. Outside the control of the MDBS, local applications submit to their LDBSs local transactions which have access only to a single LDBS. Execution results of local transactions are returned to requesting local applications directly by LDBSs. 1.2 Overview of Multidatabase Systems 1.2.1 Characteristics of Multidatabase Systems The first and most substantial characteristic of an MDBS is autonomy of LDBSs. A classification of the type of autonomy in a MDBS, proposed by Veijalainen and PopescuZeletin [VP88] and later extended by Sheth and Larson [SL90], is summarized as follow. 1. Design autonomy: refers to the ability of a LDBS to choose its own design with respect to any matter, including: I) data being managed, 2) representation (data model, query language) and naming of data elements, 3) conceptualization or semantic interpretation of data, 4) data constraints, and 5) functionality and implementation (e.g., access methods, transaction management model, concurrency control and recovery method, etc.) of the local DBMS. 2. Communication autonomy: refers to the ability of a LDBS to decide whether to communicate with other LDBSs. A LDBS with communication autonomy is able to decide when and how it responds to a request from another LDBS. 3. Execution autonomy: refers to the ability of a LDBS to execute local transactions without interference from extemal transactions (i.e., subtransactions) submitted by an 19 MDBS and to decide the order in which to execute external transactions. Thus, an MDBS cannot enforce an execution order on a LDBS with execution autonomy. Execution autonomy implies that a LDBS can abort any transactions (local or external) that does not meet its local constraints and that its local transactions are logically unaffected by its participation in an MDBS. Furthermore, the LDBS need not inform an MDBS of the order in which external transactions are executed and the order of an external transactions with respect to local transactions. 4. Association autonomy: implies that a LDBS has the ability to decide whether and how much to share its resources and functionality with others. This includes die ability of a LDBS to associate with or disassociate from an MDBS and the ability of a LDBS to participate in more than one MDBSs with different degree of sharing. As a result of the design autonomy of LDBSs, the second characteristic of an MDBS is heterogeneities among participating LDBSs. Heterogeneities may occur at different levels: 1. Syntactic heterogeneity: is primarily caused by the fact that LDBSs may not be homogeneous with respect to their data models and query languages. 2. Semantic heterogeneity: refers to the same reality being represented differently in different LDBSs. This partially results from multiple equivalent representations for the same reality expressed in the same data model and panially from different perspectives of LDBSs' designers and conflicts ha application semantics [BLN86, BCN92, SP91, KK92]. 20 3. System heterogeneity: is primarily catised by the fact that LDBSs may not be homogeneous with respect to their functionality and implementation. For instance, a LDBS may not be a first-class DBMS and therefore lack important DBMS features [LOG92]. Different LDBSs may support different access methods in query execution. Furthermore, LDBSs may have different concurrency control and recovery methods, and may use different transaction management models. 4. Operating environment heterogeneity: LDBSs may reside on computer systems with different architectures and different computational speed. On the other hand, LDBSs may be connected via different types of networks supporting different communication protocols. 1.2.2 Objectives of Muitidatabase Systems The characteristics of local autonomy and heterogeneities of LDBSs in an MDBS environment present several development requirements. In particular, an MDBS must provide an environment that provides local autonomy preservation, transparency, and multi-database consistency assurance. Performance and extensibility requirements are also essential to the development of an MDBS. 1. Local autonomy preservation: preserving the four types of autonomy of LDBSs. The MDBS must require absolutely no change to LDBSs (including data and DBMSs) and existing local applications. In other words, the MDBS must appear to any participating LDBS as just another local application and must introduce virtually no 21 change in the administration of any participating LDBS. Furthermore, the MDBS cannot enforce an association or disassociation decision on a LDBS. 2. Transparency provision: An MDBS needs to hide the above-mentioned heterogeneities from global applications. Four types of transparencies need to be provided: a) Syntactic heterogeneity transparency: obviates global applications from the need to know different data models and query languages employed by LDBSs. Two levels of syntactic heterogeneity transparency can be defined. At minimum, global applications should be able to access LDBSs in the data model and query language supported by the MDBS (called common data model and common query language). To support this essential requirement, schemas of LDBSs need to be translated into equivalent schemas in a common data model. Second, a very appealing feature of global applications is that they can choose to work with the MDBS using a data model and query language other than the common data model and query language; that is, the MDBS needs to present several interfaces in different data models and accessible by different query languages to global applications. b) Semantic and distribution heterogeneity transparency: Global applications are shielded from semantic heterogeneities among LDBSs by the MDBS. Moreover, the MDBS needs to prevent global applications from having to know locations of information, whether or not information is replicated, and how information is dispersed and can be combined from different LDBSs. To achieve this type of transparency, an integrated schema (or integrated schemas) which represents the 22 integration of LDBSs' schemas needs to be maintained at the MDBS level. Thus, by issuing global transactions through the integrated schema of the MDBS, global applications can access or manipulate the set of participating LDBSs as if they were a centralized database system. c) System transparency: Global applications need not be aware of functional and implementational differences of LDBSs. d) Operating environment transparency: The MDBS must shield global applications from the heterogeneous operating environments of LDBSs. 3. Multi-database consistency assurance: ensuring the correct and reliable execution of global retrieval and update transactions in the presence of local transactions. Both transactional integrity, which refers to the consistency of all local databases in the presence of concurrent global and local access, and semantic integrity, which refers to the consistency of all local databases with respect to integrity constraints, need to be ensured by the MDBS. However, maintaining these types of integrity in an MDBS environment is significantly more difficult than in a homogeneous distributed database environment. 4. Performance comparable to that of homogeneous distributed database systems: Hiding heterogeneities and preserving local autonomy of LDBSs considerably increase the performance overhead of the MDBS. To be practical, the performance of the MDBS should be at least comparable to that of a homogeneous distributed database system. This is a difficult yet an essential requirement for the MDBS. 5. High extensibility: The MDBS needs to be highly extensible with respect to its ability to accommodate new association of a LDBS whose data model, query 23 language, transaction management model and concurrency control method are possibly unknovm to the MDBS, the disassociation of a participating LDBS, and evolutions of participating LDBSs. In other words, an MDBS needs to adapt to any of these changes with minimal effort. 1.2.3 Schema Architecture of Muitidatabase Systems The characteristics of design and association autonomy, heterogeneity and distribution of an MDBS make three-level schema architecture appropriate for describing the architecture of a centralized DBMS inadequate to describe the architecture of an MDBS. Several reference schema architectures have been proposed to support these characteristics [B87, TBC87]. Most of them are similar to the five-level schema architecture described in [SL90]. The reference five-level schema architecture, as shown in Figure 1.2, consists of a local schema level, a component schema level, an export schema level, an integrated schema level, and a global external schema level. Moreover, to support the coexistence of global and local users in an MDBS environment, each LDBS usually will have a set of local external schemas defined on top of its local schema. Since local external schemas are mainly for local users, they are not included in the schema architecture of an MDBS. 24 Global External Schema Global External Schema Global External Schema Integrated Schema i Export Schema Export Schema : Local External Schema Export Schema : 1: \ Component Schema Component Schema Component Schema Local Schema Local Schema Local Schema Local External Schema Figiire 1.2: Reference Schema Architecture of an MDBS 1. Local schema: A local schema is the conceptual schema of a LDBS. A local schema is expressed in the local data model of the local DBMS, hence different local schemas may be expressed in different data models. 2. Component schema: A component schema is derived by translating local schema into a common data model of the MDBS. Two reasons for defining component schemas are (1) they describe the divergent local schemas using a single representation and (2) semantics that are missing in a local schema can be added to its component schema. The use of the component schemas supports the design autonomy and heterogeneity features of an MDBS. 3. Export schema: Not all data of a LDBS may be available to an MDBS and its users. An export schema represents a subset of a component schema that is available to the MDBS. The purpose of defining export schemas is to facilitate control and 25 management of association autonomy. The representation in the export schema level is the common data model of the MDBS. 4. Integrated schema: An integrated schema (sometimes called a global schema or a federated schema) is an integration of multiple export schemas. An integrated schema also includes the information on data distribution that is generated when export schemas are integrated. The integrated schema level supports the distribution and semantic heterogeneity feature of an MDBS. The representation in the integrated schema level is the common data model of the MDBS. There may be multiple integrated schemas in an MDBS, one for each class of federation users performing a related set of activities. 5. Global external schema: A global external schema defines a schema for a global user/application or a class of global users/applications. The representation in the global extemal schema level (called the extemal data model) can be the common data model of the MDBS or any data model preferred by global users/applications, as described in the objective of the syntactic heterogeneity transparency provision. The schema architecture shown in Figure 1.2 allows only one export schema for each LDBS and one integrated schema for an MDBS. However, this is only a reference model. Depending on the assumptions and requirements of the MDBS being designed, other schema architectures degenerating or extending from this reference model can be employed. For example, if all data defined on the local schema of each LDBS fiilly participate the MDBS, the export schema level can be excluded from the schema architecture; thus, the reference five-level schema architecture is degenerated into a four- level schema architecture. On the other hand, if the reference schema architecture appears too restricted when LDBSs are required to have flexibility in deciding not only whether to participate the MDBS but also which federations to join and what data to be shared in each federation, the integrated and export schema levels of the reference schema architecture need to be extended. In this schema architecture (for MDBS with multiple federations), several export schemas can be defined on the component schema of a LDBS. Each of the export schemas can be associated with one or more federations. There exist multiple integrated schemas in the MDBS each of which represents the conceptual view of a specific federation of users/applications. An integrated schema can be constructed from a set of export schemas, a set of integrated schemas, or a combination of them. 1.2.4 Research Issues in Multidatabase Systems Based on the five-level reference schema architecture discussed above, an essential research issue in developing an MDBS is the development or selection of a common data model and common query language. In addition, the characteristics and objectives of an MDBS render several challenges to the management of and the operations in an MDBS. MDBS management concerns the development and evolution processes of an MDBS, while the MDBS operation involves issues related to global transactions and query execution. Table 1.1 illustrates important research issues and their challenges in each of the categories outlined below. 27 Research Issues Characteristics of MDBS Autonomy Heterogeneity MDBS Management => Schema Design and association Management autonomy =i> Negotiation Association autonomy MDBS Operation => Global Design and execution Transaction Management autonomy => Multidatabase Query Optimization and Processing Communication autonomy Objectives of MDBS Syntactic and semantic heterogeneity Autonomy preservation, transparencies provision, and high extensibility Autonomy preservation System heterogeneity Autonomy preservation, transparencies provision, multi-database consistency assurance, performance, and high extensibility Autonomy preservation, transparencies provision, and performance System and operating environment heterogeneity Table 1.1: Challenges to Research Issues of MDBS MDBS Management: 1. Schema management: In terms of the reference schema architecture shown in Figure 1.2, schema management refers to constructing and evolving schemas in the component, export, integrated and global external schema levels during the MDBS development and evolution processes. Schema management involves such issues as schema translation, schema integration, schema definition, and schema evolution support. Detailed discussion on each of these schema management issues will be deferred until next subsection. 2. Negotiation: Negotiation which occurs between the MDBS and LDBS database administrators involves protocols for governing the specification of export schemas 28 and the declaration of authorization to global users on accessing export schemas [SL90, AB89]. The negotiation protocols need to facilitate the preservation of the association autonomy of LDBSs with the goal of satisfying the global users' requirements and minimizing authorization conflict. MDBS Operation: 1. Global transaction management: It is responsible for maintaining database consistency in the presence of local transactions while allowing concurrent global updates across multiple LDBSs. Global transaction management is much more complicated than distributed transaction management for homogeneous distributed database systems. The complication is introduced by 1) possibly different transaction management models and concurrency control methods employed by LDBSs, 2) execution autonomy preventing the MDBS from being able to enforce transaction execution orders on LDBSs and from knowing the serialization orders of transaction execution from LDBSs, 3) the coexistence of global and local transactions causing indirect conflicts of which the MDBS is unaware, and 4) performance requirements [GRS94, MRB92]. 2. Multidatabase query optimization and processing: It is responsible for decomposing a global query into subqueries, generating and determining the most efficient execution strategy among all plausible execution strategies, and translating each subquery into an equivalent query in a local query language understood by its executing LDBS. Due to the unavailability of statistics required by query optimization (constrained by communication autonomy), system heterogeneity (e.g., different access methods 29 supported by different LDBSs; thus, resulting performance difference) and operating environment heterogeneity (different computational speed of LDBSs and network speed connecting LDBSs), additional concerns need to be taken into account when developing the multidatabase query optimizer and processor for an MDBS [SL90, DKS92, LOG92, MY95]. Among the MDBS research issues mentioned above, schema management is critical to the development, evolution and operation of an MDBS. However, in the past research, schema management has not been addressed in its entirety. Thus, it deserves the more detailed discussion which will be provided in the next section. 1.3 Issues of Schema Management for Multidatabase Systems Figure 1.3 illustrates the main schema management issues considered in the five-level schema architecture of an MDBS. It is evident that the development of an MDBS is a bottom-up process which involves schema translation (translating each local schema into its corresponding component schema), schema definition (defining an export schema(s) for each component schema), schema integration (integrating all export schemas into an integrated schema), and schema definition/schema translation (defining and translating part of the integrated schema into a global external schema). 30 Global External Schema Schema Definition/ Translation Global External Schema Changes Integrated Schema Changes Export Schema Schema Integration Schema Definition Changes Component Schema Component Schema Schema Translation Changes Changes Export Schema Local Schema Local Schema Figure 1.3: Schema Management Issues on the Five-level Schema Arcliitecture 1. Schema translation hides the syntactic heterogeneity in an MDBS and is needed in two situations: between the local and component schema levels, and between the integrated and global external schema levels. Since the local data model of a LDBS is usually not as expressive as the common data model, some data semantics are missing in the local schema of the LDBS and are often embedded in the extension of the local schema (i.e., local database). Therefore, the schema translation from the local schema to its corresponding component schema involves a schema enrichment process to improve the semantic level of the component schema. Schema enrichment discovers missing semantics from local databases and will reduce the difficulty of schema integration that can make use of these additional semantics to easily detect and resolve semantic heterogeneity. 31 2. Schema integration is required to provide semantic and distribution heterogeneity transparency. It refers to integrating multiple schemas (export, component and/or integrated schemas, depending on the schema architecture of the MDBS) into a single integrated schema by identifying and resolving the semantic heterogeneity existing between/among schemas to be integrated. 3. Schema definition provides facility to support the specification of an export schema from a component schema or a global external schema from an integrated schema of an MDBS. The schema definition facility requires a view mechanism, similar to that for specifying external schemas in centralized database systems, in the common query language of the MDBS. 4. Schema evolution support deals with the propagation of changes in any schema level of the schema architecture to other affected schema levels in a dynamic MDBS environment. The schema evolution may be caused by 1) association of a LDBS into the MDBS, 2) disassociation of a LDBS from the MDBS, 3) changes to the local or export schema of a LDBS, 4) changes to a global external schema, or 5) creation of a new global external schema. Support for different types of schema evolution involves different combinations and sequences of schema translation, schema integration, schema definition, and housekeeping operations. The housekeeping operations are used for updating the affected schema(s) in the schema architecture without involving schema translation, integration, and definition. Table 1.2 shows the required support for each type of schema evolution. 32 Type of Schema Evolution New Association Disassociation Changes to Local or Export Schema Changes to Global External Schema Adding New Global External Schema Schema Translation V Schema Integration V V yj Schema Definition V Housekeeping Operations V V V V V V V V Table 1.2: Required Support for Different Types of Schema Evolution 1.4 Research Motivation and Objective Tremendous advances in networking, the exponential growth of databases, and the everincreasing need for global information sharing will bring an MDBS into a large-scale environment. It has been recognized that finding solutions to interoperability in a large network of databases will be a major research area in the next decade or so [BKG93, SSU91, SYE90]. "Large-scale" is a relative term. In terms of schema management issues, it is not defined by the number and geographical dispersion of LDBSs participating in an MDBS. Rather, it is characterized by 1) the magnitude and variation of local data models employed by LDBSs being associated with an MDBS, 2) complicated semantic heterogeneity of participating LDBSs, and 3) dynamic changes of schemas, participation and revocation of LDBSs. A more detailed elaboration of each characteristic and its challenges to schema management in a large-scale MDBS environment are as follows: 33 1. In a large-scale MDBS environment, the assumption made in some prototype MDBSs [BOT86, TBC87, ADD91, ADK91, ADSY93] that the MDBS supports only a given set of local data models is inappropriate, because other data models (proprietary, variations of a data model, or possibly newly emerged) may be adopted as local data models of LDBSs. Moreover, the number of external data models needed to be supported also increases in a large-scale MDBS environment. The magnitude and variation of local and external data models impose a fundamental requirement that schema translation must be declarative and extensible to easily accommodate various, possibly unseen, local and external data models. 2. The semantic heterogeneity in a large-scale MDBS environment is more complicated than that in a smaller-scale MDBS environment. Increased and more complicated semantic heterogeneity calls for a schema integration process that reduces interactions with database administrators during an extremely difficult schema integration process. This can be approached in two ways. One is to reduce the amount and complexity of the semantic heterogeneity which needs to be identified and resolved by means of transforming schemas that make them semantically equivalent but more representationally compatible. The other is to adopt some form of intelligent support to discovering data semantics buried in local databases so that the need for interaction with database administrators will be lessened. 3. A large-scale MDBS envirorunent is dynamic because I) new LDBSs may join into the MDBS, 2) LDBSs already participating in the MDBS may cease their participation in the MDBS, and 3) local or export schemas of LDBSs may change over time. Thus, schema evolution support will become a de facto requirement. As 34 shown in Table 1.2, schema translation, schema integration and schema definition are often needed in different types of schema evolution. Facility for schema definition should be invariant, regardless of the scale of the MDBS, because it is simply a view mechanism on the common query language of an MDBS. However, increased requests for the schema evolution support in a dynamic environment amplify the challenges to schema translation and schema integration mentioned above. The first approach to facilitating the schema integration process is to normalize schemas based on some pre-defined normalization criteria. Moreover, this can also be regarded as a special type of schema translation. In other words, transforming a schema is to translate the schema into another schema expressed in the same data model as the original one. Thus, it is achievable to develop a technique suitable to solving th^ two most important schema management issues: schema translation and schema normalization for facilitating the schema integration process. Past research on schema translation focused mostly on the development of model-to-model translation rules. A declarative and model-independent approach for schema translation has not yet been proposed in the literature. On the other hand, schema translation and the support for schema integration have been treated as independent research issues in previous research. The overlapping between schema translation and support for schema integration has failed to be identified and utilized in providing a satisfactory solution to these two issues. This dissertation research was motivated by the growing trend of moving an MDBS into a large-scale environment and was directed toward developing an integrated 35 methodology to meet the challenges to schema translation and the support to schema integration that exist in such an environment. 1.5 Organization of the Dissertation This chapter presented an overview of the characteristics, objectives, reference schema architecture, major research issues of an MDBS in general, and the schema management for an MDBS in particular. The motivation and objective of this dissertation research also have been discussed. In Chapter 2, the literature relevant to this dissertation research will first be reviewed, followed by the formulation of research approaches, tasks and framework. Detailed research questions to be addressed will also be defined. Chapter 3 will depict the metamodeling paradigm as well as the development of a metamodel. Synthesized Object-Oriented Entity-Relationship (SOOER) model, which can also be adopted as a common data model of an MDBS. Chapter 4 will present an inductive metamodeling technique. Abstraction Induction Technique, which induces a model schema from a set of application schemas expressed in a data model. The model schema serves as a foundation for construct-equivalence-based schema translation and schema normalization. Evaluation of Abstraction Induction Technique for inductive metamodeling will also be conducted and analyzed in this chapter. In Chapter 5, the detailed language, Construct Equivalence Assertion Language (CEAL), for the constructequivalence-based schema translation and schema normalization will be presented. Chapter 6 will be devoted to the development of algorithms for construct equivalence transformation and reasoning methods as well as the construct-equivalence-based schema translation and schema normalization. Finally, Chapter 7 will summarize contributions of this dissertation research as well as suggest future research directions. 37 CHAPTER 2 Literature Review and Research Formulation This chapter reviews the literature related to this dissertation research, focusing on the formulation of and approaches to schema translation and integration problems. Its aun is to identify common areas between these two schema management issues and adequacies as well as inadequacies of existing approaches. Based on the literature review, the foundation of this dissertation research will be analyzed as a preliminary to establishing the framework for an integrated methodology for schema translation and support to schema integration that is needed in a large-scale MDBS environment. Research questions specific to this methodology to be addressed will also be detailed. 2.1 Literature Review on Model and Schema Translation 2.1.1 Formulation of Model and Schema Translation Problems (Data) model translation and schema translation are often used interchangeably. However, as implied by their names, the former deals with translation at the data model level, while the latter deals with the translation at the schema level. The problem of model translation is usxoally formulated as: - Given two data models, - Define the translation knowledge between the model constructs of one data model and model constructs of another data model. On the other hand, the problem of schema translation is formulated as: - Given I) an application schema represented in a data model (called a source data model), and 2) translation knowledge from the source data model to another data model (called a target data model) - Derive an equivalent application schema represented in the target data model Model translation deals with the specification of translation knowledge between data models. Based on the semantics of model constructs of two data models, the translation knowledge defines equivalences between the model constructs of one data model and those of another. The detailed properties of model constructs may also need to be specified in the translation knowledge. For example, an attribute whose multiplicity property is 'multi-valued' in an entity-relationship model can be translated into a relation in a relational model. The relation includes an attribute which is the same as the multi­ valued attribute but now has the multiplicity property 'single-valued' as well as the identifier attribute(s) of the entity on which the multi-valued attribute is defined. In this example, a translation knowledge between the entity-relationship and the relational models is defined not only at the model construct level (i.e., attribute and relation) but also at the property of model construct level (i.e., multiplicity property of attribute). Moreover, the translation knowledge between two data models should ideally be bi­ directional. That is, the same set of translation knowledge can be used for translating from one data model to another and vice versa. Schema translation is the process of applying translation knowledge on an application schema expressed in the source data model to generate an equivalent application schema 39 expressed in the target data model. Since the translation knowledge involved in a schema translation requires only one direction (from a source data model to a target data model), schema translation is a directional process. 2.1.2 Approaches to Model Translation Model translations (and schema translations) usually have been developed for two major cases [YL93, ADSY93]: 1) to support a database design process where a conceptual schema expressed in a concept data model is subsequently converted into a logical schema; and 2) to support an MDBS development or evolution process where a local schema expressed in a local data model is converted into a component schema expressed in the common data model and where the integrated schema expressed in the common data model is translated into a global external schema expressed in an external data model. Most work on model translation has been related to the database design process [HC93, G93, MHR93]. Since translations in this case mainly concern mapping from a conceptual data model to a logical data model (e.g., from an entity-relationship data model or an object-oriented data model to a relational, network, or hierarchical data model), the direct-translation approach is usually adopted. The direct-translation approach refers to defining translation knowledge from model constructs of one data model to those of another data model without having formalized the two data models (i.e., formalized model constructs, their relationships and their semantics). The 40 translation knowledge is usually represented as a set of translation rules which are always directed from a source data model to a target data model. Several disadvantages are associated with the direct-translation approach. First and most importantly, it is done on a data model by data model basis rather than at a general level. Thus, no formal framework is defined and employed to guide the process of specifying translation knowledge between two data models. Second, it lacks formal support for specifying the translation knowledge (rules) from one data model to another. Because the model constructs and their semantics are not explicitly formalized, model engineers are required to have comprehensive understanding of the types, relationships and semantics of model constructs of both data models. Third, the translation knowledge is specified from one data model to another data model; it is uni-directional. Thus, the inverse model translation always requires effort by model engineers to specify another set of translation knowledge. Finally, although the translation knowledge can be represented as a set of translation rules, without formally formalized model constructs, the development of a formal language as the representation of translation knowledge universal to any two data models is hard to achieve. Thus, the translation knowledge is usually implemented procedurally rather than declaratively. The above-mentioned disadvantages hamper use of the direct-translation approach for the MDBS development or evolution process. To overcome the problems inherent in the direct-translation approach, a metamodeltranslation approach has been proposed [AT91, AT93, JJ95]. The notion of 41 metamodels and metamodeling has been adopted in the design and development of information systems [BF92, BdL89, H092] as well as computer-aided software engineering (CASE) [GCK92]. A metamodel is a model one level higher than a model and provides a set of constructs (called metamodel constructs) for formally formalizing models [vG87]. The formal specification of a model expressed in a metamodel is called the model schema of the model, while the process of conceptualization of a model is called the metamodeling process. A detailed description of metamodel and metamodeling paradigm will be provided in the next chapter. The general process in the metamodel-translation approach consists of the following two steps: 1) the model schema for each data model to be translated is formally represented in a metamodel and 2) the translation knowledge is then specified on the two model schemas. In work by Atzeni and Torlone [AT91, AT93], the metamodel employed consists of four metamodel constructs; lexical (atomic concept), abstract (abstract concept or type), aggregation (relationship between/among abstracts) and function (mapping between a concept, which can be a lexical, abstract or aggregation, and another concept). Once the model schemas for the source and target data models are represented in this metamodel, the translation knowledge is then specified on the model schemas procedurally by Pascal-like programs. Their metamodel-translation method overcomes the first two problems of the direct-translation approach. However, their method has the same disadvantages as the last two disadvantages in the direct-translation approach (i.e., uni-directional and procedural translation knowledge). Moreover, the metamodel used in this method allows modeling a data model only at the model construct level. The 42 detailed properties of each model construct of a data model cannot be represented in such a metamodel. Jeusfeld and Johnen [JJ95] proposed another method based on the metamodel-translation approach. The metamodel adopted is a specialization hierarchy of concepts. The most general concept is an element which can be specialized into either a unit or a link. A unit may consist of other units and may be connected by links. A unit can be specialized into an object unit which may have instances in the database or a type unit which does not have explicit instances in the database. On the other hand, links are distinguished according to their arity and direction. Thus, a link can be an undirected link, directed link, binary link, total link, or partial link. These subclasses of a unit or link can be further specialized. Since the metamodel adopted consists of a large number of metamodel constructs (represented as concepts in the specialization hierarchy), the model schemas of the source and target data models imply some general translation knowledge between the two data models. For example, if a model construct of the source data model and a model construct of the target data model are instances of the same metamodel construct, they are regarded as being equivalent. However, the instantiation from a model construct mto a metamodel construct is not always one to one. In this case, a set of first order logic-like rules need to be defined to classify the model construct into several cases each of which is an instance of one metamodel construct. The set of first order logic-like rules constitutes another part of the translation knowledge. To deal with the situation when a model construct in the source data model cannot find a corresponding model construct in the target data model which is an instance of the same metamodel construct as the former, or when a model construct in the source data model has more than one corresponding model constructs in the target data model, a set of mapping relationships (expressed again as first order logic-like rules) between source and target data models are used. The set of mapping relationships is independent of specific data models and reusable for translation between any two data models. Information about whether the translation knowledge specified in this method is bi­ directional is not explicitly stated in [JJ95]. Thus, it is hard to judge whether this method overcomes the third disadvantage of the direct-translation approach. Although the translation knowledge is declaratively represented in the first order logic-like rules, first order logic-like representation may not be easy to use. Moreover, as in the previous method, the detailed properties of each model construct of a data model cannot be represented in this metamodel. In sum, the goal of model translation is to develop the translation knowledge (i.e., equivalence between model constructs of two data models) based on the semantics of model constructs. The direct-translation approach for model translation is not suitable in an MDBS environment. The metamodel-translation approach has been shown to be appropriate to the schema translation in the MDBS context. Due to the problems exhibited by the two methods using the metamodel-translation approach reviewed above [AZ91, AT93, JJ95], a new schema translation method based on the metamodeltranslation approach is needed. 44 2.2 Literature Review on Schema Integration 2.2.1 Formulation of Schema Integration Problem In some of the literature, the problem of schema integration encompasses the process of schema translation [WCN92, SPD92, SP91], while in other literature the schema translation process is assumed to have been completed before the schema integration begins [BLN86]. Since schema translation is a separate schema management issue, as depicted in the previous chapter, the second view of the schema integration problem is adopted in this dissertation. Based on the five-level schema architecture shown in Figure 1.2, the formulation of the schema integration problem is as follows: - Given a collection of export schemas, - Construct an integrated schema that will support all of the export schemas. The construction of the integrated schema should not result in loss of information or the ability to query and/or update LDBSs either individually or collectively. The processes of schema integration proposed in the literature [BLN86, KK92, WCN92, SPD92] generally consist of three main steps: conflict identification, conflict resolution, and schema merging and restructuring. Figure 2.1 illustrates the conventional process of schema integration. 45 Inter-schema Relationships Schemas Conflict Identification Conflict Resolution Integrated Schema Inter-schema Corrrespondences Inter-schema Corrrespondences with Resolution Strategies Schema Merging/ Restructuring Figure 2.1: Schema Integration Process 1. Conflict Identification: Input schemas are analyzed and compared to detect possible conflicts represented as a set of inter-schema correspondences. In addition, interschema relationships, which are relationships between/among concepts in different schemas but not existing in any individual schema, may be discovered while comparing schemas. 2. Conflict Resolution: Each type of inter-schema correspondences may be associated with multiple resolution strategies. It is the responsibility of this step to determine an appropriate resolution strategy for each inter-schema correspondence. During this step, interaction with designers and users is usually required before compromises can be achieved. 3. Schema Merging and Restructuring: After resolution strategies for inter-schema correspondences have been resolved and inter-schema relationships have been collected, the schemas are ready to be superimposed, giving rise to an integrated 46 schema which will be analyzed and, if necessary, restructured in order to achieve such desirable qualities as completeness, correctness, minimality^ and understandability. The main methodological issues are related to the problem of discovering and resolving possible conflicts in the schemas to be integrated [SBD93]. Conflicts between/among schemas are mainly caused by different perspectives of designers of schemas to be integrated, equivalence among model constructs in the data model, conflicts in application semantics, etc. [BLN86, BCN92, SP91, KK92]. In the next subsection, a taxonomy of conflicts will be presented. 2.2.2 Taxonomy of Conflicts Several taxonomies of conflicts have been proposed in the literature [SP91, BCN92, SPD92, KS94, RPR94] and a synthesized taxonomy of conflicts is depicted as follows. The relationships between the synthesized taxonomy of conflicts and other taxonomies found in the literature are listed in Appendix A. 1. Semantic Conflict: Different designers may not perceive exactly the same sets of objects or may adopt different classifications. Semantic conflict indicates that interschema relationships are missing from individual export schemas [YP93, SP91, SPD92]. 2. Descriptive Conflicts: When describing related sets of real world objects, different designers do not perceive exactly the same set of properties. Descriptive conflicts 47 include naming conflicts due to homonyms and synonjons, attribute domain, scale, constraints, operations, etc. [SP91, SPD92]. 3. Structural Conflicts: Designers can choose different model constructs to represent the same real world objects. The extent to which structural conflicts may arise is related to the semantic relativism of the data model in use (i.e., to its ability to support different, although equivalent, representations of the same reality) [SP91, SPD92]. 4. Extension Conflict: Data for related concepts in different LDBSs may not be compatible. The incompatibility occurs due to different level of accuracy required by different LDBSs, asynchronous updates among LDBSs, and etc. [RPR94]. 5. Schematic Conflict: This type of conflict arises when data in one LDBS correspond to metadata of another LDBS [SP91, SPD92, SCG93]. 2.2.3 Analysis of Conflict Types and Schema Integration Process Among the five types of conflicts defined in the previous subsection, the extension conflict is the only one occurring in the data level rather than the schema level. It complicates operations in an MDBS (e.g., how can inconsistent data from different LDBSs be reconciled into a global query result?). However, conflict identification and resolution in the schema integration process is not concerned with this type of conflict. The identification and resolution of the semantic, descriptive and schematic conflicts mainly depend on understanding the semantics of export schemas and usually require the use of their extensions. On the other hand, since structural conflict results from the semantic relativism of the data model in use, understanding the semantics of model 48 constructs of the data model and equivalences among the model constructs is essential to the identification and resolution of this type of conflict. Because the nature of and the knowledge required to identify and resolve structural conflict differ from those of the other three conflict types, the former should be treated differently in the schema integration process. If equivalences among the model constructs (called construct equivalences) of the data model and an associated reasoning mechanism exist, each input schema can be transformed into a semantically equivalent and representationally compatible schema. Thus, most (if not all) structural conflicts among all input schemas can be eliminated from the transformed input schemas. The process of transforming each input schema into a semantically equivalent schema is called schema normalization, and the resulting schema is called the normalized schema. Therefore, the complexity of the conflict identification and resolution steps will be mitigated because the magnitude of the structural conflicts has previously been reduced, if not fully eliminated. Based on this analysis, a new schema integration process is then proposed as shown in Figure 2.2. In this proposed schema integration process, the first step is schema normalization which employs a set of construct equivalences and is guided by normalization criteria to transform each input schema into a normalized schema. A normalization criterion is an undesired (or desired) quality that must be avoided (or achieved) by a schema. An undesired quality can be defined as an expression of an undesired model construct or an undesired combination of model constructs. After the schema normalization step, the normalized schemas replace the original input schemas for subsequent steps in the schema integration process. 49 Construct Equivalences Schema Normalization Schemas Normalized Schemas Conflict Identification Conflict Resolution Integrated Schema Normalization Criteria Inter-schema Relationships Inter-schema Corrrespondences Inter-schema Corrrespondences with Resolution Strategies Schema Merging/ Restructuring Figure 2.2: Proposed Schema Integration Process 2.3 Research Foundation As stated in the research motivation in Section 1.4, the objective of this dissertation research has been to develop an integrated methodology to solve the challenges to schema translation and support to schema integration that are accentuated in a large-scale MDBS environment. The literature review on schema (model) translation and schema integration indicates that construct equivalences (i.e., equivalences among model constructs) of two data models or the same data model are essential to model translation and schema normalization in the proposed schema integration process. Thus, the notion of construct equivalences establishes the foundation for this dissertation research. In this section, an analysis of construct equivalences based on the semantics characteristics of data models will first be conducted. Its implications for schema normalization and model translation will then be discussed. 2.3.1 Semantics Characteristics of Data Models Non-orthogonal model constnicts: A data model provides a set of model constructs for representing the data semantics (structural, behavioral, and constraints) of application domains. The semantics of the set of model constructs constitute the semantic space of the data model [SZ93]. As has been mentioned, when using a data model it is possible that the same reality can be modeled by equivalent representations (i.e., with different combinations of model constructs) [BLN86, BCN92, SP94]. The phenomenon of multiplicity of possible representations of the same real world is C3i\sd semantic relativism [SP94, SP91, SPD92] and results from the non-orthogonal constructs supported by the data model. A set of model constructs M are orthogonal if the semantics of any proper subset of M is inexpressible by any other proper subset of M. On the other hand, a set of model constructs N are non-orthogonal if a model construct (or model constructs) m N can equivalently be expressed by other model construct(s) in N. Figure 2.3 graphically represents the semantic space of a data model and the non-orthogonal constructs supported by the data model. For example, the model constructs of the data model S consist of Csi, Cs2, ..., and Csk- Csi, €52 and Cgj are not orthogonal since the union of the semantics of Cgj and Cs2 is equivalent to part of the semantics of Cgj. Thus, there exists a construct equivalence between the union of Cgi and Cs2 and part of Csi- On the other hand, a construct equivalence exists between €53 51 and the union of Csj and Csk because the semantics of Css is equivalent to the union of the semantics of Cgj and Cs^. It is evident that the set of model constructs Csi, €52 and Cs3, or the set of model constructs Csi, Cs2 and Csj are orthogonal since any proper subset of model constructs in each set is not expressible by any other proper subset of model constructs in the same set. Semantic Space of Data Model S Cst part of Qi Cs2 Csj Cs3 u ^Sk Csi :Model construct i of data model S ^ C : Mode' construct a and b are equivalent to Csb-* ^ '' model construct c Figure 2.3: Non-orthogonal Model Constructs of Data Model The non-orthogonality of model constructs in a data model can be represented as a set of construct equivalences within the data model (called intra-model construct equivalences). Furthermore, these intra-model construct equivalences are bi-directional. In other words, an intra-model construct equivalence defines that a set of model construct(s) is equivalent to another set of model construct(s) in the same data model and vice versa. Overlapped semantic spaces: 52 As mentioned, each data model takes up a semantic space. Different data models consume semantic spaces of different sizes. The semantic space of one data model overlaps with that of another data model if there exist any equivalent relationships among model constructs of the two data models. In other words, the semantics represented by overlapping model constructs in one data model can also be expressible by their equivalent counterparts in another data model. As shown in Figure 2.4, the semantic spaces of the data model S and T overlap in terms of the constructs Cgi, Csj, ... of the data model S and C-miJ Cxn, C^o, ••• of the data model T. For example, part of the semantics of Csj in the data model S is equivalent to the semantics of Cxm in the data model T, while the semantics of Csj in the data model S is equivalent to the union of the semantics of C-m and Cxo of the data model T. Semantic Space of Data Model S -SI Semantic Space of Data Model T partpt^ 'Tl -S2 -T2 Cs3 -T3 Overlapped Semantic Space Csi Cxm : Construct i of data model S : Construct m of data model T : is equivalent to Figure 2.4: Overlapped Semantic Spaces of Data Model T and S 53 The notion of overlapped semantic spaces provides a sound foundation of defining translatibility between data models. Two data models are translatable if and only if their semantic spaces overlap; otherwise, they are not translatable. The overlapped semantic spaces of two data models can be represented as a set of construct equivalences between the two data models (called inter-model construct equivalences). Like intra-model construct equivalences, inter-model construct equivalences are also bi-directional. 23.2 Implication to Sctiema Normalization Since schema normalization is a special type of schema translation, the implications of the semantics characteristics of a data model to schema normalization will first be discussed. Schema Normalization Problem Formulation: The non-orthogonality of model constructs in a data model makes schema normalization possible. Accordingly, the schema normalization problem can be formally formulated as: - Given 1) a set of intra-model construct equivalences of a data model S, 2) normalization criteria represented as a set of undesired model constructs, and 3) an application schema represented in S. - Derive a normalized application schema represented in S which satisfies the normalization criteria. Conceptual Flow of Construct-Equivalence-Based Schema Normalization: 54 Since the schema normalization process is based on the notion of intra-model construct equivalences, it is called the construct-equivalence-based schema normalization approach. For example, given a set of intra-model construct equivalences of the data model S as shown in Figure 2.3 with €53 being the undesired model construct in the normalization criteria, the semantic space of the data model S is transformed accordingly and shown in Figtire 2.5. Since Cs3 is the undesired model construct in this normalization task, the bi-directional intra-model construct equivalence between €53 and the union of Cgj and Cgk is transformed into uni-directional from Cs3 to the union of Cjj and CjIj. Thus, the schema normalization process normalizes an application schema by following the direction of this intra-model construct equivalence to transform all instances of the model construct €53 in the application schema into equivalent instances of the model constructs Cjj and Cgk. The resulting application schema will not contain any instance of the model construct €53. Thus, it satisfies the normalization criteria specified previously. Semantic Space of Data Model S Csi GCs2 part of ^ Csi Csi C.J Csk •• Cs3 :Undesired model construct (normalization criteria) Figure 2.5: Conceptual Flow of Construct-Equivalence-Based Schema Normalization 55 2.3.3 Implication to Model and Schema Translation Model and Schema Translation Problem Formulation: As mentioned, the overlapped semantic spaces of two data models present an opportunity and provide direct mappings for translation between these two data models. However, the non-overlapped semantic spaces of the two data model render two challenges to the model translation: semantic loss and the need for semantic enhancement. In the context of schema translation, the semantics loss refers to a situation where semantics covered in the source application schema caimot be fully expressed by the model constructs of the target data model, thus resulting in loss of part of the semantics of the source application schema. On the other hand, the semantic enhancement enriches the target application schema by discovering more semantics which are neither expressible in the source data model nor explicitly expressed in the source application schema. As depicted in Figure 2.6, the semantic space of the source data model S which is not overlapped with that of the target data model T is the cause of the semantics loss since model constructs in this non-overlapped semantic space may not be directly representable in the target data model. For example, there exists no model construct in the target data model T directly equivalent to the model construct of Cji in the source data model. As a result, any semantics in the source application schema represented by Csi may not be able to be expressed in the target data model. Thus, this part of the semantics in the application schema will be lost when translating the application schema represented in the data model S into that in the data model T. Therefore, one of the objectives of model translation is to provide a systematic way of minimizing such 56 semantics loss resulted from the non-overlapped semantic space of the source data model. Translation Direction Semantic Space of Source Data Model S Csi ^ Semantic Space of Target Data Model T part of &t ^S2 Cs3 source of semantic loss domain for semantic enliancement Figure 2.6: Source of Semantic Loss and Domain for Semantic Enhancement On the other hand, the semantic space of the target data model which is not overlapped with that of the source data model defines the domain for semantic enhancement since the semantics expressible by model constructs in this non-overlapping semantic space of the target data model are not readily available in any model construct of the source data model. For example, as shown in Figure 2.6, the semantics of the model constructs Cxj, C-n and C-^ in the target data model T are not in the semantic space overlapped to that of the source data model S. When translating a source application schema represented in the data model S into in the data model T, these model constructs will not be present in the target application schema since the semantics covered by these model constructs are 57 not available in the source application schema. Thus, the second objective of model translation is to maximize the semantic enhancement result in a systematic manner. To achieve these two objectives of model translation, in addition to the inter-model construct equivalences between the source and target data models, the intra-model construct equivalences of both data models need to be employed in schema translation. Obviously, the inter-model construct equivalences serve as the translation bridge between these two data models. Moreover, the intra-model construct equivalences of the two data models can be used to minimize semantic loss and maximize the semantic enhancement result during a translation. For example, as shown in Figure 2.6, there exists no construct in the target data model which is equivalent to Cji or €52- However, if there exists an equivalence between the union of the semantics of Cgi and €52 and part of the semantics of Csi (as shown in Figure 1), it can be concluded that the semantics covered by Csi and €52 in a source application schema is representable (e.g., by the construct Cf-nJ in the target data model and can be contained in its corresponding target application schema. Thus, by the use of intra-model construct equivalences of the source data model, semantic loss can be minimized. On the other hand, if there exists in the target data model an equivalence between some constructs in the non-overlapped semantic space and some other constructs in the overlapped semantic space, this intramodel construct equivalence can be employed to guide the semantic enhancement process and to increase the semantics level of the target application schema by discovering the semantics which are missing in the source application schema but are 58 representable by those constructs in the non-overlapped semantic space of the target data model. Essentially, construct equivalence serves as the representation of translation knowledge; the inter-model and intra-model construct equivalences are translation knowledge. Model translation based on this notion is called the construct-equivalence-based model translation approach. Hence, the problem of model translation is formally formulated as: - Given I) semantic space of a data model S 2) semantic space of a data model T - Derive 1) the intra-model construct equivalences of S 2) the intra-model construct equivalences of S 3) the inter-model construct equivalences between S and T The formulation of the schema translation problem becomes; - Given 1) the intra-model construct equivalences of two data models (S and T) 2) the inter-model construct equivalences between S and T 3) an application schema represented in S - Derive an application schema represented in T. Conceptual Flow of Construct-Equivalence-Based Schema Translation: The construct-equivalence-based schema translation approach becomes straightforward and systematic, as will be seen shortly. Assume the intra-model construct equivalences of two data models (S and T) and inter-model construct equivalences between S and T be shown in Figure 2.7. The schema translation from the data model S (source) to the data model T (target) is shown in Figure 2.8. Conceptually, the construct-equivalence-based schema translation consists of three stages: 59 1. source convergence stage in wliich all instances (in the source application schema) of the model constructs in the non-overlapped semantic space of S are transformed into instances (in the source application schema) of the model constructs of S in the overlapped semantic space, 2. source-target projection stage in which all instances (in the source application schema) of the model constructs of S in the overlapped semantic space are mapped into instances (in the target application schema) of the model constructs of T in the overlapped space, and 3. target enhancement stage in which some instances (in the target application schema) of the model constructs of T in the overlapped semantic space will be considered for semantic enhancement by transforming them to instances of other model constructs, including those in the non-overlapped semantic space of T. Semantic Space of Data Model S *-si Semantic Space of Data Model T partofGsi Cs2 CSJ -TI ".J VSf (GxiufGro) -T2 -T3 : C : Intra-model construct equivalence : Inter-model construct equivalence Figure 2.7: Example of Intra-model and Inter-model Construct Equivalences 60 Translation Direction Semantic Space of Source Data Model S Qi 1 d—^ Cs2 Semantic Space of Target Data Model T L ^ Cji ' Ct2 Cs3 - V r~^ Cx3 I Source Convergence Stage Source-Target Projection I Stage | Target Enhancement Stage i Figure 2.8: Conceptual Flow of Schema Translation Based on Intra-model and Intermodel Construct Equivalences As shown in Figure 2.8, the source convergence stage utilizes the intra-model construct equivalences of the source data model, the source-target projection stage employs the inter-model construct equivalences between the source and target data models, while the target enhancement stage manipulates the intra-model construct equivalences of the target data model. In terms of evolution of application schema in this schema translation process, the source convergence stage transforms the source application schema Aj into an intermediate application schema A2 which is composed of only the model constructs of S in the overlapped semantic space, the source-target projection stage translates A2 into another intermediate application schema A3 with only the model constructs of T in the overlapped semantic space, and, finally, the target enhancement stage generates the target application schema A4 which enhances the semantics level of A3 by incorporating those model constructs in the non-overlapped semantic space of T. Compared with Figure 2.7, the flow of schema translation from S to T shown in Figure 2.8 follows the 61 right-directed arrows of both intra-model and inter-model construct equivalences. If the translation direction is from T to S, then the flow of schema translation would follow the left-directed arrows in Figtore 2.7. 2.3.4 Summary of Research Foundation This subsection summarizes the research foundation discussed above and identifies the essential components of the construct-equivalence-based methodology for schema translation and schema normalization. 1. The translation knowledge is represented as inter-model construct equivalences between two data models and intra-model construct equivalences of the two data models. 2. The construct-equivalence-based schema translation is bi-directional. As illustrated in Figure 2.7, the direction for the inter-model and intra-model construct equivalences is determined according to the translation direction of a particular translation task. Requiring a method for transforming the direction of construct equivalences (inter-model and intra-model), the construct-equivalencebased schema translation approach becomes bi-directional. 3. The notion of construct equivalence empowers an integrated methodology for both model (and schema) translation and schema integration. Compared with the construct-equivalence-based schema translation, the constructequivalence schema normalization process is the same as the process of the source convergence or the target enhancement stage in its ability to transform an application 62 schema into another application schema (represented in the same data model as the original application schema) via intra-model construct equivalences. Normalization criteria required by the schema normalization process are used to direct the transformation process. Although normalization criteria are not required by the source convergence and target enhancement stages in the schema translation process, the former is directed by the model constructs in the overlapped semantic space of the source data model while the latter is directed to the model constructs in the nonoverlapped semantic space of the target data model. Thus, reasoning procedures on intra-model construct equivalences for these three processes should be highly similar if not exactly alike. The methodology based on the construct equivalence concept (called the construct-equivalence-based methodology), including a representation for and reasoning procedures on inter-model and intra-model construct equivalences, can be applied to both model/schema translation and schema normalization. 4. The construct-equivalence-based methodology for schema translation and schema normalization requires a metamodel and its associated metamodeling process. Inter-model and intra-model construct equivalences are defined according to the model constructs included in the semantic spaces of data models. To facilitate the specification of inter-model and intra-model construct equivalences required by both the construct-equivalence-based schema translation and the construct-equivalencebased schema normalization, the semantic spaces need to be formally formalized. Therefore, a metamodel and its associated metamodeling process are essential components of the construct-equivalence-based methodology. 63 2.4 Research Framework and Tasks Components essential to the development of the construct-equivalence-based methodology for schema translation and schema normalization in a large-scale MDBS environment have been identified in Section 2.3.4. They include 1) a metamodel as the formal representation of data models (or their semantic spaces), 2) a metamodeling process, 3) a construct equivalence representation for inter-model and intra-model construct equivalences, 4) a construct equivalence transformation method which transforms the direction of inter-model and intra construct equivalences conforming to a desired translation direction, and 5) a construct equivalence reasoning method for schema translation and schema normalization. The metamodel and the metamodeling process are used to formally specify data models and generate model schemas for the data models. Based on the model schemas of two data models and the construct equivalence representation, inter-model and intra-model construct equivalences required by the construct-equivalence-based model (and schema) translation and/or schema normalization can be specified. The construct equivalence transformation and reasoning methods are the processing components of the constructequivalence-based schema translation and schema normalization. The metamodeling process is knowledge-intensive and usually error-prone, similar to a modeling process for formally specifying an application schema using a data model. Therefore, an inductive metamodeling process which learns the model schema for a data model from some example application schemas without interacting with users is critical to the success of 64 the construct-equivalence-based methodology. In the MDBS context, LDBSs and hence local schemas pre-exist and can readily be used by the inductive metamodeling process. Figure 2.9 depicts the architectural framework for the construct-equivalence-based methodology for schema translation and schema normalization. The inductive metamodeling induces the model schema for a data model from example application schema represented in the data model. translation or schema The construct-equivalence-based schema normalization requires transformation and reasoning methods. both the construct equivalence They differ only with regard to construct equivalences and normalization criteria required. The construct-equivalence-based schema translation translates a source application schema represented in one data model into a target application schema represented in another data model based on the intermodel construct equivalences between and the intra-model construct equivalences of the two data models. On the contrary, the construct-equivalence-based schema normalization, which requires only the intra-model construct equivalences of one data model and normalization criteria, transforms an application schema represented in one data model into a normalized application schema represented in the same data model. 65 Construct-Equivalence-Based Schema Translation or Normalization Construct Equivalence Reasoning Method Application Schema in S Construct Equivalence Transformation Method Intra-model Construct Equivalences of S Inter-model Construct Equivalences Model Schema of Data Model S Target Application Schema in T or Normalized Application Schema in S Intra-model Construct Equivalences of T Model Schema of Data Model T Inductive Metamodeling iI Application Schema in S i1 Application Schema in T Figure 2.9: Architectural Framework for Construct-Equivalence-Based Methodology for Schema Translation and Schema Normalization 2.5 Research Questions Specific research questions that need to be addressed in each component of the constructequivalence-based methodology for schema translation and schema normalization are listed below. Research Questions Related to Metamodel: 1. What are the requirements of a metamodel? Specifically, can any existing data model be adopted as a metamodel? What are the components of a metamodel? What are the relationships among components of a metamodel? 66 Research Questions Related to Metamodeling Process: 2. A metamodeling process is one level higher than a modeling process. Is there any process higher than the metamodeling process? If so, is it required in the context of schema translation and schema normalization? 3. How does the inductive metamodeling process differ from other inductive learning problems? Can any existing inductive learning technique be adopted by the inductive metamodeling process? 4. How efficient and effective is the inductive metamodeling technique developed in this dissertation research? Research Questions Related to Construct Equivalence Representation: 5. What are the requirements for the representation of construct equivalences? Can a construct equivalence representation be declarative and suitable to both inter-model and intra-model construct equivalences? 6. What is the execution semantics of an inter-model construct equivalence? Is it the same as the execution semantics of an intra-model construct equivalence? Research Questions Related to Construct Equivalence Transformation Method: 7. What are the requirements for the construct equivalence transformation method? Research Questions Related to Construct Equivalence Reasoning Method: 67 8. Related to research question 6), is there any difference between the constructequivalence reasoning method for inter-model construct equivalences and that for inter-model construct equivalence? 9. Are there any differences among the construct-equivalence reasoning methods for the source convergence stage, the target enhancement stage, and schema normalization stage? 10. What are the other advantages of the construct-equivalence-based model (and schema) translation? 68 CHAPTER 3 Metamodel Development This chapter details the metamodeling paradigm extended from the modeling paradigm, requirements for metamodel development, and a metamodel (Synthesized Object-Oriented Entity-Relationship Model). named SOOER The model constructs of SOOER, a model definition language based on SOOER, and an instantiation mechanism between a metamodel and models will also be discussed. 3.1 Modeling Paradigm and Beyond 3.1.1 Modeling Paradigm In the modeling paradigm, shown in Figure 3.1, the modeling process takes a model as input to produce a formalized application schema. Thus, an application schema is an instantiation of constructs supported by the model and therefore should be constrained by the constraints of the model's constructs. However, a model in the modeling paradigm lacks formal model formalism, which results in several problems [vBtH91]. First, ambiguity may arise. Different analysts may have different interpretations or knowledge about the meaning and constraints of the constructs in the model, thus increasing the possibility of erroneous application schemas. Second, since the constraints of the model's constructs are not explicitly formalized, the verification of an application 69 schema represented in the model is usually performed in an ad-hoc manner and impossible on a formal basis. Finally, comparison or interoperation with other models is difficult, if not impossible. Abstraction Level Modeling Level Schema Level Model Modeling Application Schema Application Figure 3.1: Modeling Paradigm 3.1.2 Components of A Model To solve the problems pertaining to the modeling paradigm discussed above, a formal specification of a model is needed before embarking on the modeling process. To formally specify a model requires the understanding of its components. A model, basically, consists of the following components: 1. Model constructs: Model constructs are a set of high-level building blocks for defining real world systems. Based on different abstraction' views, the constructs of a model (called model constructs) can be classified into three types: ' An abstraction is a mental process used to select some characteristics and properties of a set of objects and exclude other characteristics that are not relavent [BCN91]. 70 • Structural constructs: real world systems are viewed as a set of data, data properties, and data relationships. • Behavioral constructs: real world systems are viewed as behavior of data. • Constraint constructs: real world systems are viewed as a set of logical restrictions on the data existing below the application schema level. Constraints specify data that are considered permissible [QW86, SK86,1091]. Constraints in an application schema can be classified into two types: implicit and explicit constraints [B78, LS83, UL93, EN94]. Implicit constraints are implied by the semantics of model constructs when instantiating model constructs into an application schema. For example, in an relational model, specifying that an attribute is the key of a relation implies a constraint on the attribute's values, a unique value for each tuple. However, implicit constraints are not capable of capturing all the constraints that may occur in an application domain. As a result, additional constraints, called explicit constraints, need to explicitly be specified in an application schema. The constraint constructs of a model are not intended to represent implicit constraints; rather they are used to represent only explicit constraints. 2. Model constraints: Every model has a set of built-in constraints (called model constraints) associated with the model constructs. An analogy to constraints in an application schema which specify the permissible instantiations fi*om an application schema to data, model constraints are used to verify whether an application schema is permissible or not. For example, in a relational data model, there exists a model constraint demanding 71 that every relation must have a unique name. Accordingly, duplicate relation names are not allowed in an application schema. 3. Model semantics: Model semantics ascertain the semantics of the model constructs (e.g., the meaning of a key) which will be instantiated as implicit constraints of application schemas. 3.1.3 Metamodeling Paradigm The need for formally specifying a model extends the modeling paradigm into the metamodeling paradigm, as shown in Figure 3.2. In the metamodeling paradigm, a metamodel which provides metamodel constructs is employed to formally define the specification of a model (i.e., the specification of its three components: model constructs, model constraints, and model semantics). The conceptualization process of a model is called metamodeling, while the formal specification of a model is called the model schema, which serves as the formal specification in the modeling process and provides basis for representing the application schema. 72 Abstraction Level Mo^gliPg Lml Schema Level Metamodel Metamodeling Model Model Schema Modeling Application Application Schema Figure 3.2: Metamodeling Paradigm As shown in the rightmost column of Figure 3.2, a two-level schema hierarchy associated by an instantiation relationship exists in the metamodeling paradigm: model schema and application schema levels. Figure 3.3 provides a detailed view of this schema hierarchy. A model schema includes the formal specification of model constructs, model constraints, and model semantics constraints of a model. As mentioned in the previous subsection, model constraints specify logical restrictions on model constructs, while model semantics define the semantics of model constructs. When model constructs are instantiated from the model schema into an application schema, the instantiations need to be verified against model constraints. Valid instantiations of model constructs will in turn trigger the instantiations of model semantics into implicit constraints of the application schema. Consisting of both explicit 73 and implicit constraints, application constraints define logical restrictions on application constructs and are used to ensure that every extension of the application schema is valid. Model Schema define .roles Model Constraints Model Constructs verily deflne semantics Model Semantics tngger instantiation Application Schema Application Constructs define rules instantiation Application Constraints Explicit Constraints Implicit Constraints Figure 3.3: Schema Hierarchy in Metamodeling Paradigm The advantages of extending the modeling paradigm to the metamodeling paradigm become obvious. Formally defined model constructs, along with their associated constraints and semantics, will reduce if not eliminate ambiguity related to the meaning and use of its model constructs. Verification of an application schema of the model can be achieved on a formal basis because the model constraints are formally specified and available in the model schema. Moreover, the comparison and interoperation of different models can also be performed at the model schema level. 74 3.1.4 Requirements of Metamodel An unanswered question in the metamodeling paradigm is "what should be the components of a metamodel?" If a model resulting from the metamodeling process is considered as an application of a metamodel, this question becomes identical to "what are the components of a model" which has been answered in Section 3.1.2. Accordingly, a metamodel is required to provide the following metamodel constructs essential to the metamodeling process: 1. Structural constructs to be used to model the structural aspect of the three types of model constructs. 2. Behavioral constructs to be used to capture the behavior of model constructs. 3. Constraint constructs to be used to represent model constraints. In addition, a metamodel should contain specific metamodel constraints on metamodel constructs and metamodel semantics defining the semantics of metamodel constructs. 3.1.5 Extended Metamodeling Paradigm As mentioned earlier, the modeling paradigm (in Figure 3.1) is extended to the metamodeling paradigm (in Figure 3.2) to address the need for formal specification of models. The same analogy should be applied to the metamodel level. In Figure 3.2, the metamodel used in the metamodeling process to formalize a model is not formally defined. However, without the formal specification of the metamodel, the metamodeling process has the same problems as the modeling process in the modeling paradigm. Thus, 75 the metamodel schema also should be formally defined for use in the metamodeling process. The conceptualization process of the metamodel is called the meta- metamodeling. If a new meta-metamodel is adopted for the meta-metamodeling process, another process, called meta-meta-metamodeling, would be required to formally specify the meta-metamodel. The process could continue to infinity if there were no way to stop it at some level. This can be done by using the same model for a specific level and its next higher level [BF92]. Since the main interest of this dissertation research is at the model level, there is no need to go beyond the metamodel level, at which the process should be terminated by using the metamodel as the meta-metamodel. In other words, the metamodel will formally be specified by its own constructs. Figure 3.4 depicts this extended metamodeling paradigm in which the process terminates at the metamodel level. Abstraction Level Modeling Level Schema Level Metametamodeling Metamodel Schema Metamodel Metamodeling insUntiation Model Schema Model Modeling Application Application Schema Figure 3.4: Extended Metamodeling Paradigm 76 The extended metamodeling paradigm expands the schema hierarchy from two levels to three levels with additional metamodel schema. Figure 3.5 depicts the upper two levels in the schema hierarchy. As in the two types of application constraints in the application schema level, model constraints consist of explicit model constraints and implicit model constraints. Implicit model constraints are implied by and derived from the semantics of metamodel constructs when instantiating metamodel constructs into model constructs in a model schema, while explicit model constraints refer to those not captured by implicit model constraints but which can be explicitly specified in a model schema. The components and their relationships in the metamodel schema are ahnost identical to those in a model schema since the domains of both schemas are models (in spite of their difference in role, one being the metamodel while the other being metamodeled). There is an additional instantiation relationship between metamodel semantics and implicit metamodel constraints in the metamodel schema. This instantiation relationship reflects the decision that the termination process in the extended metamodeling paradigm is the meta-metamodeling process in which the metamodel is specified by its own constructs. Consequently, the metamodel semantics will be instantiated as the implicit metamodel constraints of the metamodel schema. On the other hand, since the schema in the lower level is an instantiation of that in the next higher level m the schema hierarchy, relationships between the model schema level and the metamodel schema level (as shown in Figure 3.5) are the same as those between the application schema level and the model schema level (as shown in Figure 3.3). Thus, the validity of a model schema is governed by metamodel constraints while instantiations from metamodel semantics into 77 implicit model constraints are triggered by valid instantiations from metamodel constructs into model constructs. Metamodel Constraints define rales Metamodel Schema Metamodel Constructs Explicit Metamodel Constraints Implicit Metamodel Constraints insiantiatian Metamodel Semantics define semantics verify trigger instantiatioii define mles Model Schema Model Constructs instantiation Model Constraints Explicit Model Constraints define semantics Implicit Model Constraints Model Semantics Figure 3.5: Schema Hierarchy in Extended Metamodeling Paradigm 3.2 Metamodel: Synthesized Object-Oriented Entity-Reiationsliip Model The extended entity-relationship (EER) model, object-oriented (00) model, and first order logic (FOL) are prevalent alternatives to the metamodel in the literature [SZ93]. The EER model and its variants provide a set of semantics-rich structural constructs, but they lack the behavioral and constraint constructs required by a metamodel, as defined in section 3.1.4. The 00 model and its variants are capable of describing both the structural and behavioral aspects of models. However, the semantics-richness of their 78 structural constructs is less than that provided by the EER model. Like the EER model, the 00 model does not provide the declarative constraint specification capability, so by itself it cannot completely satisfy the requirements of a metamodel. The FOL representation consists of a set of well-formed formulas. The main strength of the FOL lies in its declarative constraint specification capability, which also permits structural modeling. However, representing the structural constructs in well-formed formulas has been found difficult [SZ93]. Furthermore, the behavioral aspect of models carmot be represented using FOL. Therefore, FOL alone caimot completely satisfy the requirements of a metamodel either. Although none of the discussed models alone can fully satisfy the reqiurements of a metamodel, each has its unique strength in meeting the requirements of a metamodel. An immediate solution is to integrate or synthesize these three models in such a way that the weakness of one can be complemented by the strengths of others. The Synthesized Object-Oriented Entity-Relationship (SOOER) model [LW91] represents an effort toward the desired approach by synthesizing and extending the concepts and notations belonging to the families of 00 and EER models to provide the necessary constructs to model the structural, behavioral, semantic constraint and heuristic knowledge pertaining to coupled knowledge-base/database systems. In the SOOER model, a production rule is employed as the representation of the constraint construct, but it has limited capability to express complicated constraints. In this section, an FOL-based constraint construct of the SOOER model will be proposed to replace the rule-based constraint construct. By integrating the semantics-rich structural constructs firom the EER model, the behavioral 79 constructs from the 00 model and the declarative constraint constructs from the FOL, the SOOER model can satisfy the requirements of a metamodel, making it appropriate for the metamodel. In the following, the constructs of the SOOER model is first described, followed by a detailed discussion of its constraint specifications. 3.2.1 Constructs of the SOOER Model The constructs of the SOOER model provide the means of representing properties, relationships, behaviors, and constraints of data of interest. The main structural constructs of the SOOER model include entity classes and relationship classes. Constructs of attributes, methods, and constraints are encapsulated in entity classes and relationship classes. The graphical notations of the structural model constructs adopted by the SOOER model are summarized in Figure 3.6. 80 Model Construct Graphical Notation Entity Class Entity Ci^ttribute^ Attribute (Single-valued) Attribute (Multi-valued) Identifier Identifier Association Name Association Relationship Entity-1 (mill, max) (mm, max) Entity-2 Superclass Specialization Relationship i Subclass-n Subclass-I Assembly Class Aggregation Relationship (nan, max) Component Class-I (ain, max) Component Class-m Figure 3.6: Graphical Notations of the SOOER Model Constructs Entity class: An entity class is an abstraction of a group of objects which have common characteristics (attributes), behavior (methods), and relationships with other objects, and share the same set of semantic constraints. Therefore, the constituents of an entity class include attributes, methods, and constraints. A subset of its attributes will be designated as the identifier of the entity class. Furthermore, an entity class may have relationships with other entity classes. Relationship class: A relationship class is a logical connection between or among entity classes. Three types of relationship classes are supported by the SOOER model: specialization, aggregation, and association. distinct semantics and is described as follows. Each type of relationship has its own 81 1. Specialization relationship class: A specialization relationship class categorizes a general entity class into one or many specialized entity classes. The general entity class serves as a superclass and each specialized entity class is a subclass of its general entity class. Specialization relationships are transitive: if A is a subclass of B and B is a subclass of C, then A is a subclass of C. One important mechanism of specialization relationship classes is inheritance, by which a subclass inherits properties (i.e., attributes, methods, constraints, and relationships) from its superclass. A specialization relationship class is characterized by completeness and disjointness properties. The completeness property of a specialization relationship class is total if the aggregation of all objects of the superclass is the same as the union of all objects of its subclasses; otherwise, the specialization relationship is partial. The disjointness property of a specialization relationship is disjoint if there is no overlapping between the objects of any two subclasses; otherwise, the specialization relationship class is said to be overlapped. 2. Aggregation relationship class: An aggregation relationship class indicates that a component entity class is a-part-of an assembly entity class. An aggregation relationship class often possesses the existence dependency between the assembly entity class and its component entity class. That is, when an object of the assembly entity class is deleted, all its component objects will also be deleted. Another property of an aggregation relationship class is called operation propagation which states that an operation performed on an assembly object will usually be propagated to its component objects. Like specialization relationship classes, aggregation relationship classes are also transitive. The participation of a component entity class 82 in an aggregation relationship class is characterized by its minimal and maximal cardinalities. The minimal cardinality of a component entity class specifies the minimum number of objects of this entity class required to contribute to an object of the assembly entity class, while the maximal cardinality specifies the maximum number of objects of this component entity class to an object of the assembly entity class. 3. Association relationship class: An association relationship class can be a unary, a binary or an n-ary relationship, in which n otherwise independent entity classes are related to one another. Similar to an entity class, an association relationship class may have its own attributes, methods, and constraints. To distinguish one association relationship class from others, each association relationship class is named uniquely. Each entity class that participates in an association relationship class plays a distinct role in the relationship. If there is no confusion with roles in an association relationship class, role names can be omitted. Furthermore, participation of each entity class related to an association relationship class is again constrained by minimal and maximal cardinalities. The minimal cardinality of an entity class participating in an association relationship class indicates the minimum number of association relationship instances in which an object of the entity class must participate, while the maximal cardinality specifies the maximum number of association relationship instances in which an object of the entity class can participate. 83 Attribute: An attribute may be either atomic or composite (i.e., decomposable into a set of subattributes which themselves may be either atomic or composite). An attribute can be stored or derived. A stored attribute of an entity class or relationship class (in the following, a generic term "class" is used for both types of classes) explicitly contains value(s) for each object of the class. A derived attribute which captures inter­ relationships caused by functional composition or heuristic knowledge between attribute values derives its value for a particular object from the value(s) of other attribute(s) of same or different objects. Different from the original SOGER model [LW91], derived attributes rather than additional rule construct are used to model the heuristic knowledge. The advantages of using derived attributes for this purpose include uniform treatment of information (stored or derived) and reduction of the number of model constructs, resulting in lower model complexity. The form of a derived attribute contains arithmetic operations, IF-THEN rules, or combination of both. An attribute is characterized by its data type, multiplicity, uniqueness, and null-specification properties, and may have a default value. On the multiplicity dimension, an attribute may be single-valued (i.e., an object has at most one value for this attribute) or multi-valued (i.e., an object may have more than one value for this attribute). Regarding the uniqueness property of an attribute, if an attribute of a class is specified as "unique", the value of this attribute is unique across all objects of the class; otherwise, duplication is allowed. The nullspecification of an attribute specifies whether a null value can be assumed by this attribute. Moreover, if an attribute or a set of attributes of an entity class whose values (or composites of values) are distinct for each object of the entity class, the attribute or the set of attributes can become the identifier of the entity class. 84 Methods: Methods of a class define the behavior of the class and in turn define the behavior of all objects of the class. Inherited from the object-oriented paradigm, the interface of a method mcludes the name of the method (i.e., signature), parameter(s), and its return type if applicable. A method of a class can be either an object method which is applied to an individual object of the class or a class method which is applied to the class as a whole. Constraints: Constraints define logical restrictions on the constructs mentioned above. Due to the declarative nature and the expressive power of the FOL and its variations, most of the constraint specification languages proposed, including ALICE [U89], CE [M84, M86], ERC-H- [T93], are FOL-based languages. Thus, an FOL-based constraint construct of the SOOER model will be developed and discussed in the next subsection. 3.2.2 Constraint Specification in the SOOER Model Constraints, associated with clzisses, are well-formed first order logic formulae. Extending the FOL [W92], the basic building blocks of constraints are variables, path expressions, and function expressions on variables or path expressions. Upon these basic building blocks, terms, atomic formulae and finally well-formed formulae are specified. These terminologies are defined as follows. Definition: Variable 85 A variable provides a way to reference an object of a class (an entity class or a relationship class), a set of objects of a class, or an attribute. Accordingly, a variable can be one of the following three forms: 1. Instance variable: refers to an object of a class. The class can be a directly-referenced class or a path-referenced class (which will be defined next). An instance variable can be quantified universally (V or all) or existentially (3 or exists). The expression of an instance variable with its domain is expressed as: V instance-variable: class | path-referenced-class or 3 instance-variable: class | path-referenced-class where | denotes "or". The first expression defines a universally quantified instance variable which denotes that each object of a particular class or a path-referencedclass. The second expression defines an existentially quantified instance variable which denotes that there exists an object of a particular class or a path-referencedclass. 2. Instance-set variable: refers to a set of objects of a class as a whole. Again, the domain of an instance-set variable can be a directly-referenced or path-referenced class. An instance-set variable cannot be quantified by an existential or universal quantifier. The expression of an instance-set variable with its domain is expressed as: instance-set-variable = class | path-referenced-class which states that the instance-set-variable refers to the set of objects of a particular class or a path-referenced-class. 86 3. Attribute variable: is bound to an attribute value of an object of a class. Howr binding is perfonned depends on whether an attribute variable is quantified universally or existentially. Furthermore, if an attribute variable refers to a composite attribute, the aggregation of its subattributes will be the domain of the attribute variable. Since an attribute can only be visited from an object of a class (which is expressed as an instance variable) followed by a sequence of attribute links with or without traversing through other classes, the attribute referenced by an attribute variable is always a path-referenced-attribute. The expression of an attribute variable with its domain is expressed as: V attribute-variable: path-referenced-attribute or 3 attribute-variable: path-referenced-attribute The first expression defines a universally quantified attribute variable which denotes each value of path-referenced-attribute. An existentially quantified attribute variable, defined in the second expression, expresses that there exists a value of the pathreferenced-attribute. Definition: Path Expression Let t be an instance variable on the class T. A path expression, denoted by t.Ai.A2...An (t is the origin of the path, while An is the terminal of the path) refers to a path in a SOOER schema, satisfying the following statements for each j € {1,..., n}: 1. If Aj is an attribute, then Ak is an attribute of A^.i, for each k e (j+l,..., n}. 2. If Aj is an entity class, then Aj is the end of the path or there exists Aj+, which is either an attribute of Aj or a relationship in which Aj participates. 87 3. If Aj is an association relationship class, then A,- is the end of the path or there exists Aj+i which is either an attribute of Aj or an entity class participating in Aj. 4. If Aj is a specialization or aggregation relationship class, then Aj+i is an entity class participating in Aj. The above statements for a path essentially have the following properties: 1. A path starts from an instance variable. 2. The terminal of a path is an entity class, an association relationship class, or an attribute, but can never be a specialization or an aggregation relationship class. 3. If a path ends at an entity class or an association relationship class, the terminal class is called a path-referenced-class. For each object of T (i.e., an object denoted by t), the path returns a set of objects of the terminal class associated to t via this path. 4. If a path ends at an attribute, the attribute is called the path-referenced-attribute. For each object of T (i.e., an object denoted by t), the path returns a value or a set of values of the terminal attribute associated to t via this path. Whether it returns a single value or a set of values depends on the maximum cardinality of each entity class and the multiplicity of the attribute involved in the path. If all maximal cardinalities involved in the path are 1 and the multiplicity of the attribute is singlevalued, then the path will return a single value; otherwise, it returns a set of values. The path expression t.A,.A2...An is an abbreviated form in which the role name of an entity class participating in its subsequent association relationship class in the path is omitted. In the full expression, to express an association relationship class in a path, the 88 preceding and subsequent entity class of the association relationship should be signified by a particular role name indicating how the path would traverse through this association relationship class. In other word, if Aj.i.Aj.Aj+i is part of a path where Aj is an association relationship class and Aj.i and Aj+j are entity classes, it is represented as Aj. i.rolej.,-»Aj->roIej.Aj+i where the preceding entity class Aj.i takes the role rolej., and the subsequent entity class Aj+j takes the role roIej. Role names are not necessary in an association relationship class when all the participating entity classes are distinct. Definition: Function Expression Variables, entity classes, association relationship classes, and path expressions (pathreferenced-classes or path-referenced-attributes) can be manipulated by functions to derive aggregate information. The most commonly used fimctions are summarized in Table 3.1. 89 Function min() Input Parameter atomic path-referenced-attribute max() atomic path-referenced-attribute sum() atomic, numerical pathreferenced-attribute an atomic, numerical pathreferenced-attribute instance-set variable, pathreferenced-class, or pathreferenced-attribute instance-set variable, pathreferenced-class, or pathreferenced-attribute avgO count() countdistinct() Description returns the mmimum value of a set of attribute values returns the maximum value of a set of attribute values returns the summation of a set of attribute values returns the average of a set of attribute values returns the total number of objects (or values) in a set v^ithout duplication elimination returns the total number of objects (or values) in a set with duplication elimination Table 3.1: Functions for Function Expression in SOOER Constraint Specification Definition: Term and Atomic Formulae A term is either a variable, a path expression, a fxmction expression, a constant, or a set of constants, while atomic formulae are constructed by composition of terms as follows: • A term is an atomic formula. • If fi and f2 are atomic formulae, then f] cp is also an atomic formula, where cp is an arithmetic operator, value-comparison operator, set-membership operator (e, comparison operator (=, c, c, 3,3, or set operator (u, n, -). Definition: Well-Formed Formulae Well-formed formulae are constructed by composition of atomic formulae as follows: • An atomic formula is a well-formed formula. set- 90 • If f, fi, and f2 are well-formed formulae, then -if, f, v fj, f, A £2, fi => fj are formulae, where "-I" stands for not, "v" for or, "A" for and, and "=>" for implies. 3.3 Metamodel Schema and Model Definition Language As shown in Figure 3.4, the metamodel (i.e., the SOOER model) needs to be formally specified through the meta-metamodeling process to derive the metamodel schema which serves as the formal specification representation for modeling models (i.e., through the metamodeling process). The meta-metamodeling process is similar to the modeling process except that the target of the meta-metamodeling process is the metamodel and the formal specification representation is again the metamodel. As depicted in Figure 3.5, the goals of the meta-metamodeling process are to formally define the four components of the metamodel schema: metamodel constructs, explicit metamodel constraints, implicit metamodel constraints and metamodel semantics. Therefore, the meta-metamodeling process can be considered as the process of structurally modeling metamodel constructs, specifying explicit metamodel constraints, defining metamodel semantics, and instantiating metamodel semantics into implicit metamodel constraints. 3.3.1 Metamodel Constructs in Metamodel Schema The metamodel constructs are structurally modeled in the metamodel constructs and graphically illustrated by Figure 3.7. 91 hss^subattnbute subattribute /\ sy Constraint Uniqueaess Entity Class Attnbute Null-spw3/ ^^ecificatiq^ composite (O.m) (O.m) Name ^Xco.m) Data-type denved Min-card Max«card Relationship Class (0,m) base compose denved'froni Method ^^rivatioo-ni Association Identifier superclass ecla ratio subclass Specialization ^mplementatiq^ Aggregation Min-card ""Max-cai^ " component Figure 3.7: Metamodel Constructs in Metamodel Schema Explanation: As shown in Figure 3.7, the metamodel provides two main metamodel constructs: EntityClass and Relationship-Class. These two metamodel constructs are described by a name property because when metamodeling a model (in the metamodeling process) a name needs to be specified for each model construct which is an instance of Entity-Class or Relationship-Class (e.g., the relation model construct in the relational data model is an instance of Entity-Class with the name "Relation"). Since the two metamodel constructs share a common property and relationships to other metamodel constructs (which will be explained later), a superclass called Class is created in the metamodel schema for these two metamodel constructs. Relationship-Class is further specialized into three metamodel constructs: Association, Specialization, and Aggregation. An association 92 relationship relating two or more possibly identical entity-classes and an entity-class optionally associating with one or more association relationships are described by a relationship between the metamodel constructs Association and Entity-Class in the model schema. Each link between an association relationship and an entity-class can be further depicted by a minimal-cardinality, a maximal-cardinality, and a role taken by the entity-class when participating in the association-relationship. On the other hand, the metamodel supports the constructs for modeling attributes, constraints, and methods, as defined in Section 3.2.1. Three metamodel constructs are included in the metamodel schema: Attribute, Constraint, and Method. Since each entity-class or relationship-class may contain attributes, methods and constraints, an aggregation relationship is created between the metamodel construct Class as its aggregate-class and the metamodel constructs Attribute, Constraint, and Method as its component-classes. The optional inclusions of attribute, constraint, and method by an entity-class or a relationship-class are described by zero minimal cardinalities on the component-classes of the aggregation relationship. The metamodel construct Attribute is described with name, multiplicity, uniqueness, and null-spec. When a model construct which is an instance of Attribute is created in the model schema during the metamodeling process, the name for this model construct, whether it is single- or multi­ valued, unique or duplicable, or null allowed or disallowed need to be specified for the next level instantiation (i.e., when this model construct is instantiated into an application schema). 93 3.3.2 Explicit Metamodei Constraints in Metamodei Schema Since explicit metamodei constraints define logical restrictions on metamodei constructs and will be used to verify instantiations of metamodei constructs into model schema as shown in Figure 3.5, explicit metamodei constraints generally take the format of: Vinstance-variable: metamodei construct ...=> assertion on path-referenced-attributes The general form consists of two parts: antecedent and assertion separated by =>. If the antecedent defined on each instance of a particular metamodei construct (i.e., each model construct which is an instance of this metamodei construct) is true, the assertion on an attribute reachable from the instance (i.e., property reachable from the model construct) must be satisfied. Table 3.2 lists the explicit metamodei constraints of the metamodei SOGER. 94 Metamodel Construct Class RelationshipClass Attribute Association Specialization Aggregation Explicit Metamodel Constraint Description • Vc: Class, Va: c.Identifier.conipose.Attribute => a.Uniqueness = 'not-null' A a.Multiplicity = 'single-valued' • Vc: Class, count(c.Identifier) = 1 => c.Identify.compose.Attribute c c.Attribute • Vr: Relationship-Class => r.Type e {'Association', 'Aggregation', 'Specialization'} • Va; Attribute a.Uniqueness e {'unique', 'not-unique'} • Va: Attribute => a.Null-Spec e {'null', 'not-null'} • Va: Attribute => a.Multiplicity e {'single-valued','multi-valued'} • Va: Association, Vr: a.relate => r.Min-Card > 0 • Va: Association, Vr: a.relate => r.Max-Card > 0 v r.Max-Card = 'm' • Va: Association, Vr: a.relate, r.Max-Card 'm' => r.Min-Card < r.Max-Card • Va: Association, Vrl: a.relate, Vr2: a.relate, rl vi r2 => rl.role ^ r2.role • Va: Specialization, Vs: a.subclass => s.Min-Card > 0 • Va: Specialization, Vs: a.subclass => S.Max-Card > 0 v s.Max-Card = 'm' • Va: Specialization, Vs: a.subclass, S.Max-Card 'ni'=> s.Min-Card < s.Max-Card • Va: Aggregation, Vs: a.component => S.Min-Card > 0 • Va: Aggregation, Vs: a.component s.Max-Card > 0 v s.Max-Card = 'm' • Va: Aggregation, Vs: a.component, S.Max-Card * 'ni'=5> s.Min-Card < s.Max-Card • Identifier must not-null and single-valued • Inclusion constraint on identifier and attribute • Domain constraint on Type • Domain constraint on Uniqueness • Domain constraint on Null-Spec • Domain constraint on Multiplicity • Domain constraint on Min-Card • Domain constraint on Max-Card • Domain constraint on Min-Card and MaxCard • Unique role name for each association • Domain constraint on Min-Card • Domain constraint on Max-Card • Domain constraint on Min-Card and MaxCard • Domain constraint on Min-Card • Domain constraint on Max-Card • Domain constraint on Min-Card and MaxCard Table 3.2: Explicit Metamodel Constraints in Metamodel Schema 95 Explanation: In Table 3.2, the first explicit metamodel constraint in the metamodel construct Class defines that for every model construct which is an instance of the Class (i.e., every instance of the Class denoted by Vc: Class) and for each attribute which is part of the identifier of this instance (expressed as Va: c.Identifier.compose.Attribute), this attribute must be declared as 'not-null' and 'single-valued' (denoted by as a.Uniqueness = 'notnull' A a.Multiplicity = 'single-valued'). The second explicit metamodel constraint in the metamodel construct Class asserts that for every model construct which is an instance of the Class (i.e., every instance of the Class denoted by Vc: Class), if the model construct has an identifier (expressed as count(c.Identifier)=l), then the attributes included in the identifier must be a subset of all attributes of the model construct (i.e., denoted by c.Identify.compose.Attribute c c.Attribute). The first explicit metamodel constraint in the metamodel construct Attribute states that for every instance of the Attribute (i.e., any model construct which is an instance of the Attribute), its uniqueness specification can only be either 'unique' or 'not-unique'. 3.3.3 Metamodel Semantics in Metamodel Schema Metamodel semantics define the semantics of metamodel constructs and will be instantiated as implicit model constraints in a model schema. Thus, metamodel semantics usually follow the form of: Vinstance-variablel: metamodel construct ...=> (Vinstance-variable2: instance-variable1 ...=> assertion on path-referenced-attributes from instance-variable2) 96 In this general form of metamodel semantics, the outer antecedent (i.e., Vinstancevariablel: metamodel construct ...) refers to an instance of a particular metamodel construct (i.e., a model construct which is an instance of the metamodel construct). The inner antecedent (i.e., Vinstance-variable2: instance-variable1...) and its assertion (i.e., "assertion on path-referenced-attributes from instance-variable2") refer to instances of the model construct declared in the outer antecedent (i.e., application constructs which are instances of the model construct). In fact, the semantics of a metamodel construct is defined in the assertion of the inner antecedent. The interpretation of this form is for each instance (model construct) of a particular metamodel construct and for each instance of this model construct, the assertion on the instance of the model construct must be satisfied. Using this expression, the metamodel semantics of the metamodel SOOER are listed in Table 3.3. 97 Metamodel Semantics Metamodel Construct • Vc: Class, Va: c.Attribute, a.Uniqueness = 'unique' => Class (Vol: c, Vo2: c, ol^o2=> ol.a7io2.a) • Vc: Class, Va: c.Attribute, a.Null-Spec = 'not-null' => (Vol: c => count(ol.a) > I) • Vc: Class, Va: c.Attribute, Vb: a.composite-^Has-subattribute->subattribute.Attribute, b.Null-Spec = 'not-null' (Vol: c, Val: ol.a.b count(al) > I) • Vc: Class, Va: c.Attribute, a.Multiplicity='single-valued'=> (Vol: c =:> count(ol.a) £ 1) • Vc: Class, Va: c.Attribute, Vb: a.composite->Has-subattribute-)-subattribute.Attribute, b.Multiplicity = 'single-valued' => (Vol: c, Val: ol.a.b => count(al) < I) Association • Va: Association, Vrl: a.relate, Vr2: a.relate. rI.Role ^ r2.Role, cl := rl.Entity-Class, c2 := r2.Entity-Class => (Vol: cl => count(ol.rl.Role->a->r2.Role.c2) > r2.Min-Card) • Va: Association, Vrl: a.relate, Vr2: a.relate, rl.Role * r2.Role, r2.Max-Card 'm'. cl :=rl.Entity-Class, c2 := r2.Entity-Class =3> (Vol:cl => count(o I .rl .Role->^a->r2.Role.c2) < r2.Max-Card) • Va: Association, Vrl: a.relate, Vr2: a.relate, Vr3: a.relate. rl.Role r2.Role, rl.Role r3.Role, r2.Role vt r3.Role, cl := rI.Entity-Class, c2 := r2.Entity-Class, c3 := r3.Entity-Class => (Vol: cl, Vo2: eI.rl.Role->a->r2.Role.c2 => count(o I .rl .Role->a-^r3.Role.c3) > r3.Min-Card) • Va: Association, Vrl: a.relate, Vr2; a.relate, Vr3; a.relate, rl.Role ^ r2.Role, rl.Role ^ r3.Role, r2.Role ^ r3.Role, r2.Max-Card ^'m', cl := rl.Entity-Class, c2 := r2.Entity-Class, c3 := r3.Entity-Class => (Vol: cl, Vo2: el.rl.Role->a->>r2.Role.c2 => count(ol.rl.Role->a->r3.Role.c3) < r3.Max-Card) Table 3.3: Metamodel Semantics in Metamodel Schema Description • Semantics of Uniqueness • Semantics of Null-Spec • Semantics of Null-Spec • Semantics of Multiplicity • Semantics of Multiplicity • Semantics of Min-Card of Unary and Binary Associations • Semantics of Max-Card of Unary and Binary Associations • Semantics of Min-Card of Ternary Associations • Semantics of Max-Card of Ternary Associations 98 Metamodel Metamodel Semantics Construct Speciali­ • Va: Specialization, Ve: a.subciass, s := a.superclass.Entity-Class, c := e.Entity-Class => zation (Vol: s => count(ol.superclass-^a->subclass.c) > e.Min-Card) • Va: Specialization, Ve: a.subclass, e.Max-Card ^'m', s := a.superclass.Entity-Class, c := e.Entity-Class => (Vol: s => count(ol.supercIass-*a->subclass.c) < e.Max-Card) Aggregation • Va: Aggregation, Ve: a.component, s := a.aggregate.Entity-Class, c := e.Entity-Class => (Vol: s => count(ol.aggregate->a->component.c) > e.Min-Card) • Va: Aggregation, Ve: a.component, e.Max-Card 'm', s := a.aggregate.Entity-Class, c := e.Entity-Class => (Vol: s => count(ol.aggregate->a-^component.c) < e.Max-Card) Description • Semantics of Min-Card of subclass in Specialization • Semantics of Max-Card of subclass in Specialization • Semantics of Min-Card of component in Aggregation • Semantics of Max-Card of component in Aggregation Table 3.3 (Continued): Metamodel Semantics in Metamodel Schema Explanation: In Table 3.3, the first metamodel semantics for the metamodel construct Class defines that for each attribute of an instance of the metamodel construct Class (i.e., each attribute of a model construct which is an instance of the Class, expressed as Vc: Class, Va: c.Attribute), if the attribute's uniqueness specification is 'unique' (expressed as a.Uniqueness = 'unique'), then for any two distinct instances of this model construct (i.e., any two distinct application constructs which are the instance of this model construct, depicted as Vol: c, Vo2: c, ol o2), these two instances must have distinct values on this attribute (expressed as ol.a 9^ o2.a). Essentially, this metamodel semantics defines the semantics of the uniqueness specification on attributes. For example, the model construct Relation in the relational data model is specified as an instance of the metamodel construct Entity-Class and, in turn, is also an instance of the metamodel 99 construct Class because Entity-Class is a subclass of Class in the metamodel schema. Further assume that the name attribute of the model construct Relation be specified as unique. According to the assertion of the inner antecedent of the first metamodel semantics for the metamodel construct Class, the name attribute of the model construct Relation being unique delineates that any two distinct instances of the model construct Relation (i.e., any two relations in an application schema) must have distinct names (i.e., distinct relation names in an application schema). The specification derived from this metamodel semantics coincides with the intention of specifying the name attribute of the model construct Relation as unique as well as with the constraints imposed on the definition for the relational data model. The second metamodel semantics for the metamodel construct Class defines for each attribute of an instance of the metamodel construct Class (i.e., each attribute of a model construct which is an instance of the Class, expressed as Vc: Class, Va: c.Attribute), if the attribute's null-specification is 'not-null' (expressed as a.Null-Spec = 'not-null'), then every instance of this model construct (each application construct which is the instance of this model construct, depicted as Vol: c) must have at least one value on this attribute (expressed as count(ol.a) >1). As such, this metamodel semantics defines the semantics of attributes being specified as 'not-null'. 100 3.3.4 Implicit Metamodei Constraints in Metamodei Schema The instantiation of metamodei semantics into implicit metamodei constraints basically is a process of instantiating instance variables defined in the outer antecedent of each metamodei semantics in terms of metamodei constructs, evaluating conditions specified in the outer antecedent of each substituted metamodei semantics against metamodei constructs, and adding the inner antecedent and its assertion into metamodei schema as its implicit metamodei constraints when the conditions are evaluated as true. As shown in Figure 3.5, another instantiation relationship exists between metamodei semantics and implicit model constraints of a model schema. The instantiation process just described for deriving implicit metamodei constraints is also valid for deriving implicit model constraints. The detailed metamodei semantics instantiation algorithm is summarized in Algorithm 3.1. 101 GIobal-Instaiitiate(S, M) /* Input; Metamodel semantics S and metamodel schema (or model schema) M Result: Metamodel schema with implicit metamodel constraints (or model schema with implicit constraints) */ Begin For each metamodel construct (or model construct) C in M For each metamodel semantics L of the metamodel construct Class in S C-Instantiate(L, C). If C is an instance of the metamodel construct Association For each metamodel semantics A of the metamodel construct Association in S C-Instantiate(A, C). If C is an instance of the metamodel construct Aggregation For each metamodel semantics G of the metamodel construct Aggregation in S C-Instantiate(G, C). If C is an instance of the metamodel construct Specialization For each metamodel semantics P of the metamodel construct Specialization in S C-Instantiate(P, C). End. C-Instantiate(T, C) /* Input: A metamodel semantics T and a metamodel construct (or model construct) C Result: C with implicit metamodel constraints (or implicit model constraints) */ Begin Initialize Stack; T = Instantiate all occurrences of the first instance-variable defined in the outer antecedent of T by C's name. Push(T, Stack). While Stack is not empty T = Pop(Stack). If T contains an uninstantiated instance-variable I in the outer antecedent For each metamodel construct (or model construct) R which is related to C and is an instance of the domain of I T = Instantiate all occurrences of I in T by R's name. Push(T, Stack). Else If (T contains no condition is the outer antecedent) or (the conditions in the outer antecedent are evaluated as true) Add the inner antecedent and its assertion of T into C. End. Algorithm 3.1: Algorithm of Metamodel Semantics Instantiation 102 Given the metamodel schema shown in Figure 3.7, Algorithm 3.1 will generate the implicit metamodel constraints for the metamodel according to the metamodel semantics specified in Table 3.3. For example, since the name attribute of the metamodel construct Class is the identifier (as shown in Figure 3.7), the uniqueness, null-spec and multiplicity properties of this attribute are 'unique', 'not-null', and 'single-valued', respectively. Accordingly, the following implicit metamodel constraints for the metamodel construct Class can be instantiated: 1. Vo1: Class, Vo2: Class, o1 ?£ o2 => o1.Name ^ o2.Name 2. Vo1: Class => count(o 1.Name) > 1 3. Vo1: Class => count(o 1.Name) < 1 The first implicit metamodel constraint depicts that when two instances of the metamodel construct Class are distinct, their name must be different. That is, the names of model constructs in a model schema must be unique. The second and third implicit metamodel constraints together define that each instance of the metamodel construct Class needs to have one and only one name (i.e., a name is required for every model construct in a model schema). Due to the implicit metamodel constraints being derivable in nature, a complete listing of all implicit metamodel constraints will not further be provided. 3.3.5 Model Definition Language Based on the metamodel constructs and the explicit metamodel constraints, a model definition language (MDL) is developed, as shown in Figure 3.8. Rather than replacing the graphical representation of the SOGER metamodel, the MDL provides an alternative way of textually describing a model schema. 103 MODEL: model-name { < ENTITY-CLASS: entity-name { [ ATTRIBUTE: attribute-declaration IDENTIFIER: attribute-name [ <, attribute-name> ] ] [ METHOD: <method-name (declaration): implementation> ] [CONSTRAINT: <constraint-name: specification> ] } > < RELATIONSHIP-CLASS: relationship-name { TYPE: type-declaration [ ATTRIBUTE: attribute-declaration [ IDENTIFIER: attribute-name [ c, attribute-name> ] ] ] [ METHOD: <method-name (declaration): implementation> ] [CONSTRAINT: <constraint-name: specification>] } > attribute-declaration := <att-name (DATA-TYPE: data-type-specification UNIQUENESS; unique-specification NULL-SPEC: null-specification MULTIPLICITY: multiplicity-specification [ HAS-SUBATTRIBU're: attribute-declaration ] [ DERIVED-FROM: attribute-name ([ <, attribute-name> ]) derivation-rule ])> type-declaration := ASSOCIATION ASSOCIATED-ENTITY-CLASS: <entity-name (ROLE: role-name, MIN-CARDINALITY: cardinality-specification, MAX-CARDINALITY: cardinality-specification)> | SPECIALIZATION SUPERCLASS; entity-name SUBCLASS: <entity-name (MIN-CARDINALITY: cardinality-specification, MAX-CARDINALITY: cardinality-specification)> | AGGREGATION AGGREGATE: entity-name COMPONENT: <entity-name (MIN-CARDINALITY: cardinality-specification, MAX-CARDINALITY; cardinality-specification)> uniqueness-specification := unique | not-unique null-specification := null | not-null multiplicity-specification := single-valued| multi-valued cardinality-specification := 0 | 1 | ... | m Annotations: < > [ ] denotes the enclosed item repeats one or more times. denotes optional. A | B denotes A or B. Figure 3.8: Model Definition Language Based on the SOOER Metamodel 104 3.4 Metamodeling Process and Model Schema Using the metamodel and its metamodel schema defined previously, the metamodeling process is the process of formalizing a model as the model schema which consists of four components: model constructs, explicit model constraints, implicit model constraints and model semantics. Appendix B lists the model schema resulting from metamodeling a relational data model. Its model constructs are graphically depicted in Figure 3.9. name Relation •consist of 1 name <[]^ataTy\| C^Unique^—— (I.m) Attribute (I.m) CSefaultValui (1.1) (O.m) Primary-Key ( 1 . 1 ) 0 (O.m) Foreign-Key ( l . m ) 0 (0, 1) referenced compose-of (0. I) participate Figure 3.9: Model Schema of Relational Data Model The metamodeling process is similar to a modeling process through which an application schema is formally specified. As mentioned in Section 2.4, it is required to support an inductive metamodeling process which leams the model schema of a data model from some example application schemas without interacting with users in a large-scale MDBS environment. The discussions on the development of an inductive metamodeling technique (called Abstraction Induction Technique) will be provided in the next chapter. 105 CHAPTER 4 Inductive Metamodeling This chapter first analyzes the characteristics of the inductive metamodeling problem, reviews existing inductive learning techniques based on a proposed analysis framework, and investigates the applicability of any existing inductive learning technique to the inductive metamodeling. Subsequently, the development of a technique for the inductive metamodeling. Abstraction Induction Technique, will be described in detail. An evaluation of this technique will also be performed and analyzed. 4.1 Overview 4.1.1 Characteristics of Inductive Metamodeling Problem The inductive metamodeling process leams the model schema of a data model from some example application schemas (called training examples) represented by this data model. More specifically, the problem of inductive metamodeling can be overviewed in the following aspects: learning strategy, type of induction, noises in training examples, representation, and specific induction output. aspects depicted in the following. Table 4.1 summarizes each of these 106 Learning Strategy Type of Induction Noises in Training Examples Representation Specific Induction Output Learning from examples (positive examples only) Characteristic and abstraction concepts Noise-free - External representation of training examples: DD (structured) or DDL (textual) - Internal representation of training examples: to be discussed in Section 4.2.1. - Representation of induction output: metamodel (SOGER) Model schema with model constructs and model constraints only Table 4.1: Summary of Characteristics of Inductive Metamodeling Process Learning Strategy: Several learning strategies have been distinguished [CMM83, Mi86, KM86, RK91] and can be classified as follows: 1. Rote learning; The learning system (called learner) simply accepts and memorizes the information provided to it. 2. Learning from instruction: The learner transforms instructions into an internallyusable form and integrates them with prior knowledge. 3. Learning by analogy: The learner acquires new facts by transforming and augmenting existing knowledge that bears strong similarity to the desired new concept into a form effectively useful in the new situation. 4. Inductive learning: Inductive learning can be subdivided into learning from examples and learning by observation and discovery. In learning from examples, the learner induces a general concept description that describes all positive examples and none of the negative examples. In learning by observation and discovery, the learner 107 investigates a domain in an unguided fashion for regularities and general rules explaining all, or at least most, observations. Among these learning strategies, the inductive metamodeling process adopts the inductive learning strategy. Specifically, since all training examples provided are positive examples, the inductive metamodeling process belongs to the category of learning from examples with only positive examples. Type of Induction: The nature of this induction task is to generate a higher-level abstraction which generalizes the structures and constraints exhibited in the training examples. In other words, the inductive metamodeling is a process of structure and constraint generalization. The induced abstraction (i.e., model schema) explains the model constructs with which training examples were defined and the model constraints to which training examples conform. Inversely, a training example is a specific instance of the induced model schema. Noises in Training Examples: Training examples used in the inductive metamodeling process are noise-free since training examples (application schemas) are extracted from or provided by pre-existing LDBSs. Representation: 108 The representation employed by the inductive metamodeling is the language employed for expressing training examples and induction output. Two levels of the representation of training examples need to be distinguished: external and internal representation. The external representation of training examples refers to the original format of training examples. A training example can be extracted from the data dictionary (DD) of a database system or formulated in a data definition language (DDL). On the other hand, the internal representation of training examples is the representation which can be efficiently and effectively manipulated by the inductive metamodeling process. If the internal representation differs from the external representation, transformation of training examples is required. The internal representation of training examples developed for the inductive metamodeling process will be defined in Section 4.2.1. Since the induction output from the inductive metamodeling is the model schema of a data model, the representation of induction output is a metamodel. Specifically, the SOGER model is adopted as the representation of induction output. Specific Induction Output: As mentioned, the induction output from the inductive metamodeling process is a model schema. A model schema consists of the specifications of model constructs, model constraints (including explicit and implicit model constraints), and model semantics. However, it may not be possible for all of the components of a model schema to be induced from application schemas. Instantiation relationships in the schema hierarchy of the (extended) metamodeling paradigm, as shown in Figure 3.3 and Figure 3.5, indicate that application constructs and explicit constraints of an application schema are 109 instantiations of model constructs. Thus, model constructs can be learned from applications constructs and explicit constraints described in training examples. Ideally, model semantics can be induced from implicit constraints of an application schema. However, since implicit constraints are application independent, they are usually embedded in DBMSs and are not specified in application schemas. Hence, the induction of model semantics in a model schema solely from application schemas is extremely difficult, if not impossible. On the other hand, since instantiations from model constructs into application constructs and explicit constraints are verified by model constraints, some patterns exhibited in an application schema may be generalized into model constraints. Therefore, some model constraints can be induced from application schemas. In all, only the model constructs and model constraints of a model schema can possibly be induced from application schemas. 4.1.2 Comparisons with Existing Inductive Learning Techniques As mentioned earlier, the inductive metamodeling process employs the learning from examples strategy which is a special type of inductive learning strategy. Therefore, it is necessary to investigate the differences between the inductive metamodeling process and other inductive learning problems as well as the applicability of existing inductive learning techniques to the inductive metamodeling process. The analysis firamework for inductive learning techniques consists of three aspects: the description of training examples, the type of concept induced, and whether the process is 110 constructive or not. The description of training examples can be either structural or attribute-oriented. The types of concept induced from inductive learning generally uiclude characteristic, discriminant and taxonomic. An inductive learning technique is constructive if its induction process changes the description space; that is, it produces new descriptions that were not present in the training examples. These three aspects are described in more detail as follows. Description of Training Examples [DM83]: 1. Structural description: Structural descriptions portray training examples as composite structures consisting of various components. For example, a structural description of a building could represent the building in terms of the floors, the walls, the roof, etc. 2. Attribute description: Training examples are described by a set of properties rather than their internal structures. The attribute description of a building might list its cost, total square-foot, condition, etc. Tvpe of Concept Induced [DM83, C90]: 1. Characteristic: A characteristic concept is a description of a class of training examples that states facts true for all training examples in the class. Thus, characteristic concepts do not necessary comply with the strict discrimination criterion. Characteristic concepts are often encoded as frames or logical formulae. 2. Discriminant: A discriminant concept is a description of a class of training examples that states only those properties of all training examples in the class that are necessary to distinguish them from the training examples in other classes. Often Ill discriminant concepts are encoded as paths from the root to the leaves of an incrementally acquired decision tree. 3. Taxonomic: A taxonomic concept is a description of a class of training examples that subdivides the class into subclasses. An important kind of taxonomic concept is a description that determines a concept clustering. Generally speaking, determination of characteristics and discriminant concepts is the subject of learning from (preclassified) examples, while determination of taxonomic concepts is the subject of learning by observation or discovery. Constructive Induction [DM83, P94, K94]: To overcome a situation in which the initially given description space of training examples yields poor learning results, the basic idea of constructive induction is to somehow transform the original description space into a space where training examples exhibit (more) regularity. Usually this is done by introducing new descriptions through combining or aggregating existing descriptions (original or constructed). The new descriptions represent concepts wdth higher-level abstraction than the descriptions from which the new descriptions were derived. Table 4.2 summarizes some existing induction learning techniques in the analysis framework depicted above. Any cell marked by "—denotes not available or not found. Most inductive learning techniques deal with attribute-oriented training examples. Only a few inductive learning techniques concern structural description of training examples. 112 All structural inductive learning techniques induce characteristic concepts, and none of them is constructive. Type of Concept Induced Characteristic Description of Training Examples Structural Attribute-Oriented Non-constructive Constructive Non-Constructive Constructive Winston's System AQ7Uni [S78] DB Learn [CCH89] [W75], Version Spaces [M77, M78] DBLeam AQ15[MMH86], [CCH90], ID3 [Q83], CN2 AQ17-DCI [CN89], C4.5 [Q93] [BM9I], AQI7HCI [WM94], CN2-MCI [K94] EPAM [FS84], COBWEB [F87], CLASSIT [GLF90] — Discriminant • Taxonomic — — Table 4.2: Summary of Existing Inductive Learning Techniques Training examples for the inductive metamodeling process are described by their internal structures (e.g., a patient table consists of the name, patient-id and address attributes) rather than by a set of properties. In terms of type of concept induced, the inductive metamodeling process aims at inducing a model schema which is a characteristic rather than a discriminant or taxonomic concept. Furthermore, training examples for the inductive metamodeling process are instances of model constructs. Since its goal is to induce concepts (model constructs and model constraints) at the model level rather than at the application level, the original description spaces need to be aggregated and generalized from application specific spaces to model specific spaces (e.g., from such 113 specific tables as "patient", "doctor", etc. to a model construct called "table"). Thus, the inductive metamodeiing process needs to be constructive. analysis framework, In sum, in terms of the the technique for inductive metamodeiing needs to deal with structural descriptions of training examples for inducing characteristic concepts in a constructive maimer. However, as shown in Table 4.2, none of the existing inductive learning techniques can be adopted as the technique for inductive metamodeiing. Thus, the development of a new technique is inevitable and will be described in the next section. 4.2 Abstraction Induction Technique for Inductive Metamodeiing To deal with the structural descriptions of training examples (i.e., application schemas) in the inductive metamodeiing process, each training example needs to be decomposed to reflect its internal structure. Since inductive metamodeiing is a constructive induction process, training examples at the application level needs to be abstracted into concepts (i.e., model constructs) at the model level. Subsequently, the remaining induction output (i.e., model constraints) will be generalized or generated. Based on this conceptual flow of the inductive metamodeiing process, the technique for inductive metamodeiing (called Abstraction Induction Technique) consists of three phases: concept decomposition, concept generalization, and constraint generation. discussions of each of the phases. The following are the detailed 114 4.2.1 Concept Decomposition Phase This phase is to represent the internal structure of each training example so as to perform the subsequent phases with effectiveness and efficiency. The internal representation for training examples needs to reflect the hierarchical decomposition structure embedded in each training example and to describe the relationships within or between hierarchical decomposition structures. Accordingly, a representation called concept graph is developed. Definitions for the concept graph and its related terminologies are given as follows. Definition: Concept Graph A concept graph, a structural representation of a training example, consists of nodes and directed links. A node represents a term in a training example. A directed links denotes relationship between nodes and can be classified into two types: has-a and refer-to link. A has-a link denotes that its destination node (called constituent node) is a part or a property of its origin node (called composite node), while a refer-to link expresses that its origin node refers to a concept of its destination node defined in the same or different concept graph. A concept node^ refers to a non-leaf node or a leaf node from which some refer-to link(s) is originated. A leaf node which is not associated with any refer-to link is called a property node of its composite node. Definition: Concept Hierarchy " The definition of a concept node is valid for typed languages but may not be applicable to typeless languages. Since all database systems and their underlying data models, to the author's knowledge, are typed systems, this assumption does not result in any loss of the generality of the concept graph representation. 115 A concept hierarchy is a subgraph of a concept graph and consists of only nodes and hasa links. The height of a node in a concept hierarchy is the inclusive number of nodes between the node and its root node. Definition: Spurious Refer-to Links If a node C refers to a node N and to a direct or indu-ect composite node S of N simultaneously, the refer-to link from C to S is called a spurious refer-to link because the semantics of this refer-to link are implied by the refer-to link from C to N. Definition: Derivable Refer-to Links and Consequential Refer-to Links If a node C refers to a set of nodes S which in turn are referenced by another node N where N refers to no other node but S, the refer-to links from C to each node in S are called the derivable refer-to links because, by adding a refer-to link from C to N, they can be derived from the refer-to link from C to N and the refer-to links from N to S. The refer-to link from C to N is called the consequential refer-to link of its corresponding derivable refer-to links. Definition: Redundant Refer-to Links If a set of derivable refer-to links and their consequential refer-to link coexist, the set of derivable refer-to links are called redundant refer-to links because they are implied by the consequential refer-to link. Definition: Well-structured Concept Graph 116 A well-structured concept graph is a concept graph without spurious, derivable or redundant refer-to links. gx;amplg: Assume that an example of the relational application schema be: Create Table Ti (Ai char(15) not-null, Aa char(20) not-null, A3 Int null, Primary-Key (Ai, A2)) The corresponding concept graph for this example is shown in Figure 4.1. In this example, T, node has four constituent nodes; Aj, A2, A3, and Primary-Key. Tj, A,, Aj, A3, and Primary-Key are concept nodes because they are either non-leaf nodes or leaf nodes from which some refer-to link(s) is originated. Aj node has two constituent nodes: char(15) and not-null. char(15) node is a leaf node without any refer-to link originating from it and therefore is a property node of Aj. The node Primary-Key is associated with two refer-to links: one to Aj and the other to A2. These two refer-to links correspond to the primary key specification in the example (i.e., Primary-Key A, A2). Furthermore, this concept graph is well-structured because it contains neither spurious, derivable nor redundant refer-to links. 117 Primary-Key cliar(15) not-null cliar(20) not-null Int null ; has-a : refer-lo Figxire 4.1; Example of Concept Graph The goal of the concept decomposition phase is to construct a well-structured concept graph for each training example. As shown in Figure 4.2, the concept decomposition phase can be decomposed into four steps: 1) concept hierarchy creation which creates a concept hierarchy for each training example, 2) concept hierarchy enhancement which transforms each concept hierarchy into a concept graph by adding appropriate refer-to links, 3) concept graph merging which merges each single-node concept graph with another related concept graph, and 4) concept graph pruning which transforms each concept graph into a well-structured one. 118 Application Schemas (Examples of a Data Model) Stopping Words and Keywords (Data Model Dependent) Concept Hierarchy Creation Concept Hierarchies Concept Hierarchy Enhancement Concept Graphs Concept Graph Merging Concept Graph Pruning Well-structured Concept Graphs Figure 4.2: Steps of Concept Decomposition Phase 4.2.1.1 Concept Hierarchy Creation This step is to create a concept hierarchy for each training example. The complexity of this step depends on the extemal representation of training examples. If it is a structured format (e.g., training examples are extracted from the DD), creation of concept hierarchies is straightforward because no parsing is required. However, if the extemal representation of training examples is a DDL format, the prerequisite of the concept hierarchy creation is to parse these unstructured, textual training examples. When training examples are represented in a DDL format, the correctness of the concept hierarchy creation depends on the effectiveness of the parsing algorithm. Stopping words (e.g.. Create) need not be included in concept hierarchies. Keywords (e.g.. Table) 119 need to be detached from training examples at the same time being associated with concept hierarchies because these associations indicate generalization possibilities (e.g., if Ti and T2 are prefixed by the keyword Table, they can be considered as instances of the model construct "Table"). Separators (e.g., "(" before A,, after not-null, etc.) which implies the internal structure exhibited in training examples should be preserved to avoid undesirable flatten concept hierarchies. Stopping words and keywords are data model dependent. For example. Table is a keyword in the relational data model, but it may be a non-keyword concept in other data models. Thus, one important domain knowledge required by the concept decomposition step, as shown in Figure 4.2, is the data model dependent stopping words and keywords. Although different data models may use different sets of separators, a set of common separators can be identified and their implication to the creation of concept hierarchies may be invariant among data models. For example, regardless data models, a comma which divides two terms in a training example usually indicates that they are siblings in a concept hierarchy. A set of common separators and their implication to concept hierarchy creation are analyzed and listed in Appendix C. Use of the domain knowledge (data model dependent stopping words and keywords) and the heuristics for separators, the parsing algorithm identifies non-stopping-word, non-keyword and non-separator terms (called regular terms) from training examples. Accordingly, for each training example the concept hierarchy creation step applies the following rules; Creation Rule 1 (C-Rule 1): 120 If a regular term C is the first regular term of the training example, create a new concept hierarchy with the root node named C for the example. C-Rule 2: If a regular term C in the training example is a constituent concept of another regular term P, create a node (named C) for C and add a has-a link from the node P to the node C in the concept hierarchy corresponding to the example. C-Ruie 3: If the keyword K in the training example is not immediately followed by a regtilar term, then 1) create a node for K by applying C-Rule 1 or C-Rule 2, and 2) set the possible-generalization-name of the node K. as K. C-Rule 4: If a regular term C in the training example is a sibling concept of another regular term S and the node S is the root of the concept hierarchy representing the example, then 1) create a node with the name C, 2) create a node with a name same as the keyword K prior to S in the example, 3) set the node K as the root node of the concept hierarchy, 4) add has-a links from K to S and from K to C, 5) set the possible-generalization-name of the node K as K, and 6) reset the possible-generalization-name of the node S as un-initialized. C-Rule 5: 121 If a regular term C in the training example is a sibling concept of another regular term S and the node S is not the root node of the concept hierarchy representing the example, then 1) create a node with the name C, 2) add a has-a link from the composite node of S to C, and 3) if the possible-generaiization-name of the node S has been set, set the possiblegeneralization-name of C as that of S. C-Rule 6: If the regular term C whose immediately prior term is a keyword K in the training example, set the possible-generalization-name of the node C as K. Example: Assume that three training examples in a relational data model and the keywords for the relational data model be as below: Example 1 Create table Ti (Ai char(15) not null, Aa char(20) not null, A3 Int null, Primary-Key (A,, A2)) Example 2 Create table T2 (Ai Int not null, A2 real null, Bi char(15) null, 02 char(20) null, Primary-Key (Ai)) Example 3 Foreign-Key T2 (B^, 82) reference Ti (Ai, A2) Keywords Table, Primary-Key, Foreign-Key The corresponding concept hierarchy created for each training example is shown in Figxire 4.3. According to C-Rule 6, the possible-generalization-name of the node T, in the first concept hierarchy and of the node T2 in the second concept hierarchy would be 122 set to "Table". As a result of applying C-Rule 3, the possible-generalization-name of the node Primary-Key in the first and the second concept hierarchy is "Primary-Key". When constructing the concept hierarchy for Example 3 before T j is encountered, the root node of the concept hierarchy is T2 because the immediately prior term of T2 is a keyword and thus C-Rule 6 is applied. Thus, the possible-generalization-name of T2 is "Foreign-Key" now. When processing the regular term T1 in this example, T, should be a sibling node of T2. According to C-Rule 4, a new root node "Foreign-Key" whose possible- generalization-name is also "Foreign-Key" is created for this concept hierarchy whose immediate constituent nodes would be T, and T2, and the possible-generalization-name for the node T[ will be un-initialized. 123 Bampfrl Pnmary-Key char(lS) BOt-null not-null cfaar(20) null int Ai ^2 BtampltZ A: Ai zr / \ A int \ \ real A T3 char(15) null not-null Primary-Key Bz B, char(20) null \ null Foreign-Key Bamptea X T2 Ti \ B, B2 / A, Az : has-a Figure 4.3: Concept Hierarchies (Example 1,2 and 3) After Concept Hierarchy Creation 4.2.1.2 Concept Hierarchy Enhancement Concept hierarchy enhancement evolves concept hierarchies into concept graphs by substituting the leaf nodes which imply reference relationships with refer-to links. A reference relationship exists when a leaf node of a concept hierarchy is identical to a non- 124 leaf node (i.e., concept node) in the same or different concept hierarchy or when a leaf node which is qualified by the root node of another concept hierarchy is identical to a leaf node of that concept hierarchy. In the first case, if the global naming scheme (i.e., the names of the concept nodes in all training examples should be unique) is employed in defining these training examples, the names of all non-leaf nodes in all concept hierarchies would be different. As such, there would be no confusion in determining whether the non-leaf node referenced by a leaf-node is in the same or different concept hierarchy. However, the names of non-leaf node concepts which are unique within a training example usually are not unique globally. For example, the names of the attributes of a relational table are unique though different relational tables may have attributes with the same name. To avoid ambiguity resulted from the local naming scheme in defining these examples, inter-example references (i.e., a concept in one training example refers to other concept in another training example) require qualifications while intra-example references need not be qualified. The qualification often begins from the top-most concept of the training example (i.e., the root node of the concept hierarchy) to with the concept being referenced. On the other hand, the second case does not contain any ambigtaity because the leaf node which implies a reference relationship has already been qualified by the root node of another concept hierarchy. Referring to Figure 4.3, the concept hierarchy for Example 3 contains two qualification nodes: T2 and Tp Each of these qualification nodes explicitly depicts the concept hierarchy to which their constituent nodes refer. Thus, it is un-ambiguous to say that A, and A2 of Ti refer to the concept hierarchy of Example 1 rather than that of Example 2. 125 As shown in Figure 4.3, the qualification nodes appear as the composite nodes of those leaf nodes been qualified. However, it is also possible that a qualification node appears as a sibling node of the qualified leaf nodes. Therefore, rules for concept hierarchy enhancement need to deal with both situations. Moreover, in the case of inter-example reference, the qualified leaf nodes need to be substituted by refer-to links. That is, referto links are created and the qualified leaf nodes are removed. After qualified leaf-nodes are substituted and removed, their qualification nodes should be removed fi^om the concept hierarchy. The removal of qualification nodes is due to the fact that it is the product of the local naming scheme and its existence depends on the existence of the leaf-nodes been qualified. Based on the local naming scheme, the possible locations of the qualification nodes, and the removal of the qualification nodes as discussed above, the rules for the concept hierarchy enhancement are as follows; For each leaf node in every concept hierarchy. Enhancement Rule 1 (E-Rule 1): If the leaf node L is the same as a non-leaf node N in the same concept hierarchy and N does not appear in any other concept hierarchy, then create a refer-to link fi-om the direct composite node of L to N and remove L from the concept hierarchy. E-Rule 2: If the leaf node L is the same as some non-leaf nodes N in the same and different concept hierarchies, and neither the non-root composite (direct and indirect) nodes of L nor the sibling nodes of L include any root node of the concept hierarchy 126 which contains N, then create a refer-to link from the direct composite node of L to N of the same concept hierarchy as L and remove L from the concept hierarchy. E-Rule 3: If the leaf node L is the same as some (leaf or non-leaf) nodes N in the same and/or different concept hierarchies and Cj, C2, ...Cn (C| is the composite node of L, C2 is the composite node of Cj, is the composite node of Cn.i) are the non-root direct and indirect composite nodes of L where €„ is the same as the root node of a different concept hierarchy H, then create a refer-to link from the durect composite node of €„ to N of H and remove L from the concept hierarchy. Furthermore, after removing L if Ci becomes the leaf node, remove C, as well. This removal process continues from C, to Cj (where i < n) and terminates when C; is a non-leaf node. E-Rule 4: If the leaf node L is the same as some (leaf or non-leaf) nodes N in the same and/or different concept hierarchies, and the non-root composite (direct and indirect) nodes of L do not include any root node of the concept hierarchy which contains N, and one of the sibling nodes S of N is the same as the root node of a different concept hierarchy H which contains N, then create a refer-to link from the direct composite node of L to N of H and remove L from the concept hierarchy. Furthermore, after removing L, if S does not have any sibling node, remove S as well. For each leaf node L in every concept hierarchy, E-Rule 1 deals with unambiguous intraexample references and E-Rule 2 is for references without qualification which imply 127 intra-example references. E-Rule 3 is for references with qualification (expressed in the composite nodes of L) which denote inter-example references, while E-Rule 4 is for references with qualification (expressed in the sibling node of L) which also denote interexample references. If none of the rules is applicable, no action need to be taken on L. Example: Referring to Example 1 in Figure 4.3, the leaf node Ai (whose composite node is Primary-Key) is the same as the non-leaf node A, of Tj (in Example 1) and A, of T2 (in Example 2). Since the non-root composite (direct and indirect) nodes of this leaf node (in this case, only Primary-Key) do not include T2 (i.e., the root node of another concept hierarchy where A^ appears also), the reference relationship implied by this leaf node should be an intra-concept hierarchy reference and thus E-Rule 2 is applied. According to E-Rule 2, the leaf node Ai (of Primary Key) of Example 1 is deleted and a refer-to link is added from Primary Key to the non-leaf node Aj in Figure 4.3. The same process can be applied to the leaf node A2 of Primary Key in Example 1. Regarding the remaining leaf nodes of Example 1, they are not the same as any non-leaf node in all concept hierarchies, no action will be taken for these leaf because since they represent the properties of their direct composite nodes. The enhancement process for the concept hierarchy for Example 2 is similar to that for Example 1. The resulting concept graphs enhanced from Figure 4.3 for Example 1 and 2 are shown in Figure 4.4. 128 Primary-Key char(15) cbar(20) not-null int null Pnmary-Key int \ not-null real null char(15) \ null char(20) : bas-« null : refer-to Figure 4.4: Concept Graphs (Example 1 and 2) After Concept Hierarchy Enhancement The concept hierarchy enhancement process continues for Example 3. As shown in Figure 4.3, the leaf node B, is qualified by its composite node (T2) because T, is the root node of the concept hierarchy for Example 2. Thus, E-Rule 3 is triggered. As a result, a refer-to link is added cormecting the composite node of T2 (i.e., Foreign-Key) in Example 3 to the node in Example 2, and Bi of T2 in Example 3 is removed. However, the qualification node T2 will not be deleted since it still has a constituent node (B2). The same process is applied to the leaf node B2 which is qualified by its composite node T2. After the substitution of this leaf node by another refer-to link from the composite node of T2 in Example 3 to the node B2, the qualification node T2 in Example 3 should be removed since it is a leaf node now. The concept hierarchy enhancement process proceed in Example 3. E-Rule 3 is applied for the leaf node Aj and A2 of T(. As 129 a result, two refer-to links are added from the composite node of Ti (i.e., Foreign-Key) in Example 3 to the node Ai and A2 in Example I, respectively. The node A,, A2, and T, in the concept hierarchy of Example 3 are removed accordingly. The resulted concept graphs after the concept hierarchy enhancement is shown in Figure 4.5. Pnmary-Kcy 7ZS__ZZ3L 7Z5 char(15) not-null A: int \ not-null real not-null ctiar(20) B, \ null char(lS) int Primary-Key Bi \ null char(20) null \ null Foreign-Key : ha5-a : refer-to Figure 4.5: Concept Graphs (Example 1,2 and 3) After Concept Hierarchy Enhancement 4.2.1.3 Concept Graph Merging The goal of concept graph merging is to merge all single-node concept graphs with other concept graphs. A single-node concept graph is defined as a concept graph containing only one node from which some refer-to link(s) is originated. A single-node concept graph exists because the information pertaining to one training example was fragmented 130 and represented in different examples. For example, assume Example 1 given above be represented in two training examples as below. 1,1 Create Table T, (Ai char(15) not-null, A2 char(20) not-null, A3 int null) Example 1.2 Primary-Key Ti (Ai, A^) The concept graphs after the concept hierarchy enhancement for these two training examples are shown in Figure 4.6. As shown on the right of Figure 4.6, the concept graph of Example 1.2 consists of only one node from which two refer-to links are originated: to Ai of Tj and to A2 of T2 in Example 1.1. The concept graph of Example 1.2 can be merged with the concept graph of Example 1.1 and the node Primary-Key will become one of the constituent node of T, in the concept graph of Example 1.1. The result of this merging is shown in Figure 4.7. The concept graph for Example 1.1 and 1.2 after merging is identical to that for Example 1 (i.e., the upper part of Figure 4.5). This is because the unification of Example 1.1 and 1.2 is the same as Example 1. Primary-Key char(I5) not-nuil char(20) int null : has-« • : refer-to Figure 4.6: Concept Graphs (Example 1.1 and 1.2) Before Concept Graph Merging 131 Primary-Key char(15) not-null char(20) not-null int null Figxire 4.7: Concept Graph (Example 1.1 with 1.2) After Concept Graph Merging Based on the illustration given above, the concept graph merging algorithm can be stated as: "if a concept graph has only one node R and the destination nodes of all refer-to links originated from R are all in another concept graph H, then add a has-a link from the root node of H to R." However, this algorithm is not complete because it neglects the situation when R refers to multiple concept graphs. Example 3 shown in Figure 4.5 serves as an explication of this situation. The concept graph for Example 3 has only one node (i.e., Foreign-Key) which refers to two concept graphs. To which concept graph the node Foreign-Key should be merged? The decision can be made based on the roles of the referenced nodes with respect to the node Foreign-Key. Referring to Example 3, the clause T2 (Bj, B2) plays an active role in defining the concept Foreign-Key, while the clause Ti (Ai, A2) plays a passive role. Semantically, the concept graph with one single node R should be merged into the concept graph whose constituent concepts actively participate in defining R. However, without additional domain knowledge, it is difficult to determine the role played by the referenced node. The heuristics of determining into which concept graph the single-node concept graph should be merged is needed. Based on the observation that the concepts been referenced by the single-node concept graph 132 with the active role usually appear before those with the passive role (e.g., in Example 3, T2 (Bj, B2) appears before T, (A,, A2)), the heuristics can be stated as; "the concept graph with the single node R should be merged into the concept graph first referenced by R." Accordingly, the concept graph of Example 3 in Figure 4.5 is merged into the concept graph of Example 2. Using the heuristics based on the reference sequence principle, the rules for the concept graph merging can be formalized as follows: For each concept graph. Merging Rule 1 (M-Rule 1): If the concept graph has only one node R and the destination node(s) of the refer-to link(s) originated from R are all in another concept graph H, then add a has-a link from the root node of H to R (i.e., R becomes the constituent node of the root node ofH). M-Rule 2: If the concept graph has only one node R, the destination node(s) of the refer-to link(s) originated from R are in more than one concept graphs, and the concept graph H contains the concept first referenced by R, then add a has-a link from the root node of H to R. Example: As discussed above, the concept graph of Example 3 should be merged into the concept graph of Example 2, according to M-Rule 2. The concept graphs after the concept graph merging step are shown in Figure 4.8. Please note that Example 2 in Figure 4.8 is the unification of the original Example 2 and 3. 133 Bamp»1 Pnmary-Key r:^jzi:^_z3 char(15) not-null cliar(20) B, ^2 a int \ not-null /\ real not-null int Bj ziYZir \ \ char(lS) null null null Primary-Key Foreign-Key char(20) null : has-a : refer-to Figure 4.8: Concept Graphs (Example 1 and 2) After Concept Graph Merging 4.2.1.4 Concept Graph Pruning Concept graph pruning aims at removing spurious and redundant refer-to links and replacing derivable refer-to links with consequential refer-to links. The result of this step is a set of well-structured concept graphs. The rules for concept graph pnming are as follows: For each concept node C of every concept graph. Pruning Rule 1 (P-Rule 1): If any spurious refer-to link originated from C is found, remove this refer-to link. P-Rule 2: 134 If any set of redundant refer-to links originated from C are found, remove these redundant refer-to links. P-Rule 3: If any set of derivable refer-to links originated from C are found, replace them by their consequential refer-to link. Example: Since the concept graphs for Example 1 and 2 (as shown in Figure 4.8) do not contain any spurious and redundant refer-to link, P-Rule 1 and P-Rule 2 are not applied. However, the refer-to links from the node Foreign-Key of Example 2 to Ai and to A2 of Example 1 are derivable because A, and A2 of T, are referenced by Primary-Key of Ti which does not reference to any other node. Therefore, according to P-Rule 3, these two derivable refer-to links will be replaced by a single refer-to link from Foreign-Key of T2 to Primary-Key of Ti. The final concept graphs after the concept graph pruning are shown in Figiure 4.9. As can be proved, each concept graph in Figure 4.9 is a wellstructured concept graph. 135 Primary-Key char(15) not-null cliar(20) not-null int null T; Primary-Key int not-null real char(15) null Foreign-Key char(20) null null Figure 4.9: Concept Graphs (Example 1 and 2) After Concept Graph Pruning 4.2.2 Concept Generalization Phase Once all training examples are represented in well-structured concept graphs, the concept generalization phase is initiated. The goal of the concept generalization phase is to generalize each set of "similar" nodes or links into a model construct (in the model schema) which in fact is an instance of a metamodel construct (e.g., attribute, entityclass, association relationship, specialization relationship, or aggregation relationship). Specifically, given a set of the well-structured concept graphs generated by the concept decomposition phase, the concept generalization generalizes each set of "similar" property nodes into a higher-level property concept which will later become an attribute of entity-classes, each set of "similar" concept nodes into an entity-class, has-a links into 136 aggregation relationships or attributes of entity-classes, and refer-to links into association or specialization relationships in the model schema. These generalization tasks require the following four steps: 1. Generalization of property nodes, 2. Generalization of leaf concept nodes, 3. Generalization of non-leaf concept nodes and their immediate downward has-a links, and 4. Generalization of refer-to Links. One of the challenges to the above-mentioned generalization tasks in the concept generalization process concerns with defining the similarity of property nodes or concept nodes. The similarity measurement for the property nodes is based on the abstraction of property values (i.e., the higher-level concept these property values represent). For example, as shown in Figure 4.9, there exist the property nodes of "char(15)", "char(20)", "int" and "real" which should be regarded as a set of "similar" property nodes since the higher-level concept these property values represent is "data-type". However, without a domain knowledge which defines the higher-level concept for all or part of these property values, it is extremely difficult to assert that these property values should be generalized into the same higher-level concept by looking up a thesaurus if available. Therefore, the domain knowledge of property generalization hierarchies needs to be employed m the concept generalization phase. On the other hand, the similarity measurement for the concept nodes is based on the structural decomposition of these concept nodes. In other words, two concept nodes are similar if their immediate 137 constituent concept nodes and immediate property nodes are the same or greatly overlapped. According to this structural decomposition similarity measurement, no domain knowledge is required for identifying a set of "similar" concept nodes from all possible concept nodes. The input and output of each step in the concept generalization phase are shown in Figure 4.10. Each of these steps, the definition of property generalization hierarchy as well as the similarity measurements will be presented and illustrated in the following subsection. Well-structured Concept Graphs New Generalization of Property Nodes values Property Generalization Hierarchies New property values, and new property generalization hierarchies Generalization of Leaf Concept Nodes Entity-class instances IVIodel Schema Generalization of Non-leaf Concept Nodes and Downward Has-a Links Generalization of Refer-to Links Entity-class instances, aggregation relationship instances, and attributes Association and specialization relationship instances Figure 4.10: High-Level View of Concept Generalization Phase 138 4.2.2.1 Generalization of Property Nodes The purpose of the generalization of property nodes is to generalize each set of "similar" property nodes into a higher-level property concept which will later become an attribute of entity-classes corresponding to the composite nodes of these property nodes. As mentioned, the similarity measurement for the property nodes requires the domain knowledge of property generalization hierarchies. Definition: Property generalization hierarchy A property generalization hierarchy consists of nodes and links. The root node of a property generalization hierarchy is the property concept, while the intermediate nodes and leaf nodes are specific property values of this property concept. A link, linking a child node to a parent node and representing an "is-a" relationship, defines that the parent node (a property concept or a property value) is a generalization of the child node (property value). Example: As shown in Figure 4.11, char(15) and char(20) are generalized into the property value char(n) where n is an integer, while data-type is the property concept of char(n) and int. 139 Data-type y V int char(n) • Data-type = {int, char(n)} • char(n) = {char(15), char(20)} V V char(15) char(20) Figxire 4.11: Example of Property Generalization Hierarchy However, determination of property generalization hierarchies to be included and their degree of completeness can be problematic. Although it is evident that a more complete set of will improve the effectiveness of the generalization of property nodes, it is neither feasible nor practical to enumerate all of the possible property concepts of each data model or all possible property values. Hence, the approach of having a complete set of property generalization hierarchies for each data model or for all data models is not achievable. Rather, an adaptive approach which evolves the initial set of property generalization hierarchies from training examples during the inductive metamodeling process is chosen. Rules which expand existing property generalization hierarchies by adding new property values or new property generalization hierarchies are needed and will be discussed later. Since property nodes will be generalized into attributes of entity-classes in model schemas, it is necessary to examine the dimensions related to the notion of attribute. An attribute, besides its name, usually is described in such dimensions as data types, null specification (whether the attribute allows null values or not), uniqueness specification (whether the values of the attribute are unique or not), and multiplicity (whether the 140 attribute can take more than one value). Hence, four property generalization hierarchies which are data model independent are constructed and employed during the concept generalization phase: data-type, null-spec, uniqueness, and multiplicity. Initial property values of each property generalization hierarchy are listed below: 1. data-type = {int, integer, char(n), character(n), text, real, float} 2. null-spec = (null, not null, null allowed, null not allowed} 3. uniqueness = {unique, not unique} 4. multiplicity = {single-valued, multi-valued} The similarity between a property node i in a concept graph and a property value j in a property generalization hierarchy is defined as: Similarity(i, j) where =1 L(F(ni,nj)) + L(R(ni,nj)) = — (L(ni) + L{nj))/2 ifni = nj if n: ^ n: ' n; is the name of property node i, nj is the name of property value j, F(s, t) returns the longest common leading substring of s and t, R(s, t) returns the longest common trailing substring of s and t, and L(s) returns the length of string s. This similarity function measures the proportion of the number of characters in common (by position from both ends of strings) of two strings. Example: The similarity of the property node "char(15)" and the property value "char(n)" is: F("char(15)", "char(n)") = "char(" => L(FC'char(15)", "char(n)")) = 5. R("char(15)", "char(n)") = ")" => L(R("char(15)", "char(n)")) = 1. 141 Siinilarity("char(15)", "char(n)") = (5+lH((8+7)/2) = 0.8 The similarity of the property node "not-null" and the property value "not null" is: F("not-nuir', "not null") = "not" => L(F("not-nuU", "not null")) = 3. RC'not-null", "not null") = "null" => L(R("not-nuH", "not null")) = 4. Sunilarity("not-null", "not null") = (3+4)H-((8+8)/2) = 0.875 Employing the initial data model independent property generalization hierarchies and the similarity measurement function, the generalization of property nodes generalizes each property node if possible and update the property generalization hierarchies if needed. The rules for the generalization of property nodes are defined as follows: For every property node of each concept graph. Generalization Rule 1 (G-Rule 1): If the possible-generalization-name of the property node N is set as P (in the concept hierarchy creation step), then set the generalization-name of N as P. G-Rule 2: If the possible-generalization-name of the property node N is not set and the highest Similarity(N, P) where P is any property value in the property generalization hierarchies is > h (threshold with the default is 0.5), then 1) set the generalization-name of N as the name of the root node of P, and 2) if Similarity(N, P) < 1, create a new node with the name of N as the child of P in the property generalization hierarchy where P is located. If neither of these two generalization rules is applicable for a property node, the generalization-name of this property node carmot be determined. Since no other 142 knowledge is available, no action for these un-generalized property nodes needs to be taken. Further generalization for the un-generalized property nodes will be performed in the generalization of non-leaf concept nodes and has-a links step. Example: The property value with the highest similarity to the property node char(15) of A, in Example 1 (as shown in the upper part of Figure 4.9) is char(n). Since the property node char(15) whose possible-generalization-name is not initialized during the concept hierarchy creation step and Similarity("char(15)", "char(n)") is 0.8 (see the similarity computation shown above) which is higher than the default threshold, G-Rule 2 is applied. As a result, the generalization-name of this property node is set to the root node of the property value char(n) (i.e., data-type) and a new property value char(15) is inserted into the data-type property generalization hierarchy as the child of the property value char(n). The generalization process continues for every property node of Example 1 and 2. The concept graphs for Example 1 and 2 after the generalization of property nodes completes is shown in Figure 4.12. 143 T, A, AZ Primary-Key •__Z~V char(15) data-type not-null null-spec A, int data-type cliar(20) not-null data-type null-spcc AI real data-type not-null null-spec _735 int data-type BI ciiar(15) data-type null null-spec null null-spcc Primary-Key cbar(20) data-type null null-spec \ Foreign-Key —: has-a # • : refer-to null null-spec name Figure 4.12: Concept Graphs (Example 1 and 2) After Generalization of Property Nodes 4.2.2.2 Generalization of Leaf Concept Nodes Leaf concept nodes in a concept graph refer to those leaf nodes from which any refer-to link is originated. Foreign-Key ip Figure 4.12 is an example of leaf concept nodes. The presence of associated refer-to links to other concept nodes (leaf or non-leaf) suggests entity-classes be created for leaf concept nodes. Since a leaf concept node is at the leafnode level of a concept graph, it does not have any constituent node beneath. Therefore, no attribute will be created for the entity-class corresponding to the leaf concept node. The rules for the generalization of leaf concept nodes are listed and illustrated below. For every leaf concept node C of each concept graph, G-Rule 3: Leaf Concept Node Entity-Class 144 Set the generalization-name G of the leaf concept node C as its possiblegeneralization-name (if available) or the name of the leaf concept node (i.e., C). If there exists no entity-class in the model schema with the same name as G, then create an entity-class with the name G in the model schema. Example: The leaf concept node of "Primary-Key" of Example 1 (as depicted in Figure 4.12) has two refer-to links to the concept nodes A, of Tj and A2 of Tj, respectively. As illustrated in the concept hierarchy creation step, the possible-generalization-name of the concept node "Primary-Key" is "Primary-Key". According to G-Rule 3, the generalization-name of this leaf concept node is assigned as "Primary-Key" and an entity-class Primary-Key is created in the model schema. The concept graph for Example 2, as shown in Figure 4.12, consists of two leaf concept nodes: Primary-Key and Foreign-Key. Since there already exists an entity-class Primary-Key in the model schema created previously, no entity-class creation action will be performed for the leaf concept node "Primary-Key" of Example 2. Similar to the actions taken for the leaf concept node of "Primary-Key" in Example 1, the generalization-name of this leaf concept node is set to "Foreign-Key" and an entity-class Foreign-Key is created in the model schema. The resulting concept graphs and the model schema after the generalization of leaf concept nodes are shown in Figure 4.13. 145 Eiatnp>»1 char(15) data-type A, int data-type not-null null-spec not-null cliar(20) null-spec data-type B, A: real data-type not-null null-spec cliar(lS) data-type null null-spec zns int data-type null null-spec B2 Primary-Key Primary-Key cliar(20) data-type null null-spec Pnmary-Key Pnmary-Key \ « null null-spec Foreign-Key Foreign-Key —• : has-* • : refer-to name seneraiizatioii Model Schema Primary-Key Foreign-Key Figure 4.13: Concept Graphs (Example 1 and 2) and Model Schema After Generalization of Leaf Concept Nodes 4.2.2.3 Generalization of Non-Leaf Concept Nodes and Immediate Downward Has-a Links Once the generalizations have been performed on all leaf nodes (property nodes or leaf concept nodes) of every concept graph, the generalization process moves up for the nonleaf concept nodes and their immediate downward has-a links. The goals of this generalization step are to identify each set of "similar" leaf concept nodes which will be generalized as an entity-class in the model schema, and to generalize their immediate 146 has-a links into aggregation relationships or attributes of entity-classes. As mentioned, the similarity measurement for the non-leaf concept nodes is based on their structural decomposition. Formally stated, the similarity between two non-leaf concept nodes N; and Nj is defined as: Similarity(Nj, Nj)= 1 if p(Nj) = p(Nj) if p(Nj) ^ p(Nj) = 0 [Gi n Gj| |Gj n Gi( where otherwise. pilcji p(N) is the possible-generalization-name of the concept node N, C|j is the set of immediate constituent nodes of the concept node N^, Gk is the set of generalization-names of (with null removal but without duplicate elimination), S n T is the subset of S which appears in the intersection of S and T (without duplication elimination), and jSj is the cardinality of the set S. Choice of this similarity fimction can be justified as follows. If the possible- generalization-names of two non-leaf concept nodes are identical, regardless their structural decomposition, they should be regarded as instances of the same model construct; hence, the similarity between them is always 1. On the other hand, if their possible-generalization-names are initialized in the concept hierarchy creation step but are distinct, they are instances of different model constructs semantically. Thus, the similarity is always 0. Since Nj and N,- are non-leaf concept nodes, |Cj| and |Cj| > 0. Moreover, 0 < jG-, n Gj| < IC,] and 0 < jOj o GJ < jCj]. The last portion of the similarity function is mathematically definable in [0, 1], so is the similarity fimction. E?^amplg; 147 Assume the possible-generalization-names of the non-leaf concept nodes A] and A2 in Example 1 (shown Figure 4.12) be not set. The similarity between A| and A2 is computed as follows: CAI = {char(15), not-null} => IC^il = 2. CA2 = {char(20), not-null} => |CAI| = 2. Gai = {data-type, null-spec}, 0^2 = {data-type, null-spec} => |GAI ^ GA2| = I {data-type, null-spec}| = 2 and |GA2 ^ GAII = I {data-type, nullspec}! = 2. Therefore, Similarity(A[, A2) = (2*2)^(2*2) = 1 since the generalizations of the constituent nodes of A, and A2 are identical. Example: This example will demonstrate the situation when SnT TnS. Assume a non-leaf concept node P have constituent nodes pi, p2, P3, and P4. Their generalization-names are gi, g2, g3, and g3, respectively. Another non-leaf concept node Q has three constituent nodes qi (whose generalization-name is gi), q2 (which has not been generalized yet), and q3 (whose generalization-name is gs). The Similarity(P, Q) is computed as follows: P = {Pb P2> P3, P4} => |P|=4. Q = {qb 02.03} => IQI = 3. Gp = {gi, g2, g3, gs}. Gq = {g,, g3} => |Gp n GqI = I {g,, g3, g3} I = 3 and |GQ n Gp| = I {gi, gs} I = 2. Hence, Sunilarity(P, Q) = (3*2)-;-(4»3) = 0.5 < 1 since the generalization of the constituent nodes of P is not identical to that of Q. Given a set of non-leaf concept nodes and the similarity between any two nodes, defining a set of similar non-leaf concept nodes with respect to the similarity threshold TI is essential. Assume C,, C2, •••, and be non-leaf concept nodes. The similarity which is greater than or equal to r| between any two nodes is highlighted by a solid line, while the 148 similarities less than TI are graphically represented as dashed lines, as shown ui Figure 4.14 a). Based on this graphical notation, the determination of sets of similar non-leaf concept nodes essentially is to partition the well-connected similarity graph into subgraphs each of which contains nodes reachable (via solid lines) directly or indirectly from any other node in the subgraph and unreachable (via solid lines) from any node in other subgraphs. Using this definition, the set of non-leaf concept nodes in Figure 4.14 a) is partitioned into two sets of similar non-leaf concept nodes: {Ci, C2, C3, C4, Cg} and {C5} shown in Figure 4.14 b). C, ; non-leaf concqil node i n : similarity threshold — : similarity i ti : similarity < n a) Similarity between Any Two Non-Leaf Concept Nodes b) Two Sets of Similar Non-Leaf Concept Nodes Figure 4.14: Example of Sets of Similar Non-Leaf Concept Nodes 149 It is obvious that the set of non-leaf concept nodes which are similar to a particular node Cj with respect to TI is the set of nodes in the well-connected similarity graph reachable via solid lines directly or indirectly from Cj. Formally, the set of non-leaf concept nodes S similar to CJ can be defined recursively as: S = {Cj I Similarity(Cj, Cj) > TI, CJ E U, CJ 9^ C;} U {Ck I Similarity(Ck, Cj) > TI, C|c e U, C^ e S, C^ Cj} where U is the universe (i.e., the set of all non-leaf concept nodes). Given the functions of computing the similarities of non-leaf concept nodes and defining sets of similar non-leaf concept nodes, the generalization of non-leaf concept nodes and immediate has-a links follows the algorithm described in Algorithm 4.1. 150 I* Input: All concept graphs G, model schema M and property generalization hierarchies P Results: M with new entity-classes and aggregation relationships, G with generalization-name of non-leaf concept nodes initialized, and P with new property generalization hierarchies and/or new property values added into existing property generalization hierarchies. */ Begin Add all non-leaf concept nodes in G into U. Sort U in accordance with the heights of non-leaf concept nodes in a descending order so as to perform the generalization of non-leaf concept nodes and has-a links in the bottom-up manner. Repeat > Construct the well-connected similarity graph from U. > Get the first non-leaf concept node C, in U. > Find S as the union of Ci and the set of non-leaf concept nodes similar to C[. > Apply the generalization rules G-Rule 4 to 10 (defined below) on S. > U = U - S /* remove all of the non-leaf concept nodes in S from U */ Until U is empty. End. Algorithm 4.1: Generalization of Non-Leaf Concept Nodes and Downward Has-a Links For a set of similar non-leaf concept nodes S, the following generalization rules can be applied: G-Rule 4: Set of Similar Non-leaf Concept Nodes -> Entity-Class Create an entity-class E in the model schema for S. If a possible-generalization-name G exists in S, set the name of E as G; otherwise, set the name of E as the default "construct_#" where # starts at 1 and increments by 1 every time it is employed. Set the generalization-name of each non-leaf concept node in S as the name of E. 151 G-Rule 5: Identifier of Entity-Class If |S| > 1, add an attribute (default name is "name") into E as the identifier of E. This attribute is used to distinguish the non-leaf concept nodes in S from each other. The null-specification, uniqueness and multiplicity properties of this attribute are set to "not-null", "unique" and "single-valued", respectively. G-Rule 6: Identifier of Entity-Class If |S| = 1 and the immediate property nodes of the only non-leaf concept node N in S contain one or more uninitialized generalization-name, add an attribute (default name is "name") into E as the identifier of E. This attribute is used to distinguish the property-nodes of N which may be instances of N from each other. Same as in GRule 5, the null-specification, uniqueness and multiplicity properties of this attribute are set to "not-null", "unique" and "single-valued", respectively. G-Rule 7: Generalized Property Node Attribute of Entity-Class For each generalized property node P (i.e., the generalization-name of P has been set) of every non-leaf concept node in S, If there exists no attribute (in E where E is the entity-class for S) whose name is die same as the generalization-name of P, add a new attribute A (whose name is the same as the generalization-name of P) into E in the same order as that of property nodes in all non-leaf concept nodes in S. G-Rule 8: Ungeneralized Property Node —> Attribute of Entity-Class For each ungeneralized property node P (i.e., the generalization-name of P has not been set) of every non-leaf concept node in S, 152 If all of the sibling property nodes of P are ungeneralized and the number of attributes of E (where E is the entity-class for S) is the same as the number of immediate property nodes of S, 1) set the generalization-name of P as the name of the attribute of E whose position in the attribute list of E is the same as P's position in the property node list of S, and 2) create a new property value (with the same name as P) whose parent node is a property concept which has the same name as the generalization-name of P. If all of the sibling property nodes of P are ungeneralized and the number of attributes of E (where E is the entity-class for S) is different from the number of immediate property nodes of S, 1) create a new property generalization hierarchy where the name of its root node (i.e., property concept) is set as "property_#" where # starts at 1 and increments by 1 every time a new property generalization hierarchy is created, 2) add a new property value (with the same name as P) as the child node of the root node of the newly created property generalization hierarchy, 3) set the generalization-name of P as the same name as the root node of the newly created property generalization hierarchy, and 4) add a new attribute A (whose name is the same as the generalization-name of P) into E in the same order as that of property nodes in all non-leaf concept nodes in S. 153 If one of the left sibling property nodes of P has been generalized, 1) set L as the nearest generalized left sibling property node of P, 2) find the attribute A (from the attribute list of E) which is fth attribute right to the attribute for L where / is the distance between P and L (i.e., 1 + the number of property nodes of S between P and L), 3) set the generalization-name of P as the same name as A, and 4) create a new property value (with the same name as P) whose parent node is a property concept which has the same name as A. If none of the left sibling property nodes of P has been generalized and one of the right sibling property nodes of P has been generalized, 1) set L as the nearest generalized right sibling property node of P, 2) find the attribute A (from the attribute list of E) which is fth attribute left to the attribute for L where / is the distance between P and L (i.e., I + the number of property nodes of S between P and L), 3) set the generalization-name of P as the same name as A, and 4) create a new property value (with the same name as P) whose parent node is a property concept which has the same name as A. G-Rule 9: Properties of Attributes of Entity-Class For each attribute A, 1. If every non-leaf concept node in S has at least one property node corresponding to A, set the null-specification property of A to "not-null", and "null" otherwise. 2. If the null-specification property of A is "not-null" and if the names of all property nodes corresponding to A of all non-leaf concept nodes in S are all 154 different, set the uniqueness property of A to "unique", and "not-unique" otherwise. 3. If every non-leaf concept node in S has at most one property node corresponding to A, set the multiplicity property of A to "single-valued", and "multi-valued" otherwise. G-Rule 10: Immediate Downward Has-a Links —> Aggregation Relationship If any of the non-leaf concept nodes in S has constituent concept node(s) (leaf or non-leaf), 1. Create an aggregation relationship R in the model schema between E and the entity-classes {Ei, E2, ..., En} which are the union of the entity-classes corresponding to all constituent concept nodes of all non-leaf concept nodes in S. 2. The default name of R is "consist_of_#" where # starts at 1 and increments every time when a new aggregation-relationship is created. 3. The entity-classes Ei, E2, is connected to R as component-classes, while E is the aggregate-class of R. 4. For each Ej related to R: The minimal cardinality of Ej with respect to E is the minimum of the numbers of constituent concept nodes (whose generalization-name is Ej) of all non-leaf concept nodes in S. The maximal cardinality of Ej with respect to E is "m" (for many) if the number N of the constituent concept nodes (whose generalizationname is Ej) of every non-leaf concept node in S is not identical; otherwise, the maximal cardinality of Ej with respect to E is set as the number N. 155 Example: Referring to Example 1 and 2 in Figure 4.13, U after sorting by height is {T,.Ai, T1.A2, T,.A3, T2.A,, T2.A2, T2.B1, T2.B2, T,, T2}. The prefix by the name of the composite node (e.g., Tj) of a non-leaf concept node (e.g., Ai) is only for the illustrative and representational clarity and convenience. The well-connected similarity graph for U is shown in Figure 4.15 (similarity links with zero values are not shown in the diagram). It is obvious that the similarity between any pair of nodes in the upper part of Figure 4.15 is 1 because the generalizations of their constituent nodes are identical (i.e., data-type and null-spec). That is, Similarity(Ti, T2) = 1 since their possible-generalization-names are the same. T,.A2 T,.B ; similarity = 1 T| T, Figure 4.15: Similarity Graph for All Non-Leaf Concept Nodes of Example 1 and 2 Assume the similarity threshold r\ be 0.2. S (the set of non-leaf concept nodes similar to the first non-leaf concept node Tj.A,) includes T,.Ai, T[.A2, T[.A3, Tj.A,, T2.A2, T2.B1, and T2.B2 as shown in Figure 4.15. According to G-Rule 4 and 5, an entity-class construct 1 with the name attribute as its identifier is created in the model schema and 156 the generalization-name of each non-leaf concept node in S is set as "construct_r'. GRule 7 adds two attributes into the entity-class construct_l: data-type and null-spec. According to G-Rule 9, the null-specification (and multiplicity) property of the data-type attribute of construct_l is set to "not-null" (and "smgle-valued) because the data-type property node appears at least once (and at most once) as the constituent nodes of each non-leaf concept node in S. The uniqueness property of the data-type attribute is set to "not-unique" since some of the data-type property nodes of the non-leaf concept nodes in S share the same name (e.g., char(15) is shared by T,.A[ and T2.B1). Similarly, the nullspecification, uniqueness, and multiplicity property of the null-spec attribute of construct_l can then be determined by applying G-Rule 9. Since none of the non-leaf concept nodes in S has constituent concept node, G-Rule 10 is not applicable. Subsequently, as depicted in Algorithm 4.1, U = U - S is executed and the next generalization iteration starts. Now, U contains only Ti and T2 and Similarity(T|, T2) again is 1 since the possible-generalization-names of Tj and T2 are the same. Since Similarity(Ti, T2) is greater than the similarity threshold, S = {Tj, T2}. According to GRule 4 and 5, an entity-class called "Table" (i.e., the possible-generalization-name of T, and T2) with the name attribute as its identifier is created in the model schema and the generalization-name of T, and T2 is set as "Table". Because neither T, nor T2 has an immediate constituent property node, G-Rule 7 can not be applied. On the other hand, T1 has four immediate constituent concept nodes which correspond to the entity-classes construct_l and Primary-Key, at the same time the six immediate constituent concept nodes of T2 correspond to the entity-classes construct_l, Primary-Key, and Foreign-Key. 157 Thus, G-Rule 10 is applied; resulting in the creation of an aggregation relationship consist_of_l linking the entity-class Table (as the aggregate-class) to the entity-classes construct_l, Primary-Key, and Foreign-Key (as the component-classes) in the model schema. The minimal cardinality of construct_l with respect to Table is 3 which is the minimum of 3 (T, has three immediate constituent concept nodes A[, A2 and A3 which are generalized into construct_l) and 4 (Ai, A2, Bj, and 83 of T2). The maximal cardinality of construct_l with respect to Table is "m" because Tj and T2 have different number of immediate constituent concept nodes corresponding to construct_l (3 and 4, respectively). The minimal and maximal cardinality of Primary-Key and Foreign-Key are determined likewise. After the determination of cardinalities for component-classes of consist_of_l, T] and T2 are removed from U which now becomes an empty set. The generalization of non-leaf concept nodes and has-a links is considered completed. The resulting concept graphs (for Example 1 and 2) and the model schema are shown in Figure 4.16. 158 T, Table Bampiej: Aj coiistract_l Ai coiistnict_l cliar(15) data-type Bammtg A, constnict_l int data-type \ \ z::x not-null nnll-spcc not-null char(20) null-spcc data-type Primary-Key Primary-Key coiistnict_l int data-type null null-spec Table A2 constnict_l real data-type not-null null-spec \ B, constnict_l cbar(lS) data-type \ construct^l char(20) data-type null null-spec null null-spec Primary-Key Primary-Key Foreign-Key Foreign-Key ; has-a : refer-to null null-spec name gcncnlizatiOD Model Schema CT nanit name^^^ r (3,m) Construct_l Table (1.1) Primary-Key (0. 1) Foreign-Key null-spcc Table, name (not-null, unique, single-valued) Constnict_I. name (not-null, unique, single-valued) Construct_l. data-type (not-null, not-unique, single-valued) Construct_l. null-spec (not-null, not-unique, single-valued) Figure 4.16: Concept Graphs (Example 1 and 2) and Model Schema After Generalization of Non-Leaf Concept Nodes and Has-a Links 4.2.2.4 Generalization of Refer-to Links By now, the remaining ungeneralized parts in concept graphs are refer-to links. As shown Figure 4.16, Primary-Key has refer-to links to A, and to A2 of T,. At the meta 159 level, these two refer-to links signify that the model construct Primary-Key relates to the model construct construct_l; hence, an association relationship should be constructed between them in the model schema. Semantically, refer-to links denote association or specialization relationships between the entity-classes created in the previous steps. Whether a refer-to link is generalized into an association or specialization relationship depends on whether the entity-classes corresponding to the origin and to the destination concept nodes of the refer-to link are the same or different. If a refer-to link connects concept nodes corresponding to different entity-classes, as illustrated previously, it will be generalized into an association relationship because a model construct (i.e., entity-class instance in the model schema) usually is not a superclass or subclass of a different model construct. On the contrary, if a refer-to link connects concept nodes corresponding to the same entity-class, the origin concept node usually is the subclass of the destination concept node. In an 00 data model, for example, a class C is specified as a subclass of another class S. When constructing concept graphs for this example, the concept node for C refers to the concept node for S. Since C and S are instantiations of the same model construct, their corresponding entity-class E should be the same. The refer-to link from C to S denotes a specialization relationship in which E for the origin concept node C is the subclass and E for the destination concept node S is the superclass. Furthermore, if a concept node (whose corresponding entity-class is C) contains a set of refer-to links R whose destination concept nodes correspond to the same entity-class E in 160 the model schema, a decision needs to be made regarding whether R be generalized into one or more relationships in the model schema between C and E. The answer is contingent first on the type of relationship into which R should be generalized and secondly on whether all of the destination concept nodes of R are in the same or different concept graph. If specialization relationship is an appropriate type for R (i.e., C is the same as E), only one specialization relationship should be created for R, because all specialization relationships imply identical semantic meaning and interpretation (superclass-and-subclass). If there are more than one specialization relationships between two entity-classes, only one of them is significant and the rest of them are redundant and therefore should not exist. Unlike the decision on the number of specialization relationships for R, the semantical correctness of permissible coexistence of multiple association relationships with distinct meanings between two entity-classes requires additional consideration for the decision on the number of association relationships for R. Different concept graphs referenced by a concept node assume different roles in defining this concept node, as discussed in Section 4.2.1.3. Thus, in spite of all destination concept nodes of R corresponding to the same entity-class, the subset of R which ends at one concept graph should not be considered the same as another subset of R ending at different concept graphs. Accordingly, when association relationship is the type for R (i.e., C differs from E), if all of the destination concept nodes of R are in the same concept graph, one association relationship is sufficient for generalizing R; otherwise, the number of association relationships for R is the number of different concept graphs being referenced by R. 161 Since the coexistence of multiple association relationships with distinct meanings between two entity-classes is unusual in model schemas, the above-mentioned consideration for the decision on the number of association relationships for R will not be taken into account in the following rules for the generalization of refer-to links. Accordingly, the process of and rules for the generalization of refer-to links is defined as follows: For each refer-to link, G-Rule 11: Refer-to Link Association Relationship If the entity-classes Ei and E2 corresponding to the origin and the destination concept nodes of the refer-to link L are different and if there exists no association relationship whose first associated-class is Ej and second associated-class is E2 in the model schema, then 1. Create an association relationship R in the model schema for L. The default name for R is "relate_#" where # starts at 1 and increments by 1 every time an association relationship is created in the model schema. 2. R cormects Ei (the first associated-class of R) with E2 (the second associatedclass of R). 3. The minimal cardinality of E2 on R is the minimum number of the concept nodes of E2 (i.e., those concept nodes whose generalization-names are E2) referred by each concept node of E, of every concept graph. The minimal cardinality of E, on R is the minimum number of the concept nodes of Ei referring to each concept node of E2 on every concept graph. 162 4. The maximal cardinality of E2 on R is "m" (for many) if the number N of the concept nodes of E2 referred by each concept node of E| of every concept graph is not identical; otherwise, the maximal cardinality of E2 on R is set to the number N. Similarly, the maximal cardinality of E[ on R is "m" if the number N of the concept nodes of E, referring to each concept node of Ej of every concept graph is not identical; otherwise, the maximal cardinality of E, on R is set as N. G-Rule 12: Refer-to Link -> Specialization Relationship If the entity-classes corresponding to the origin and the destination concept nodes of the refer-to link L are the same (say E) and if there exists no specialization relationship whose superclass and subclass are E in the model schema, then 1. Create a specialization relationship S in the model schema for L. The default name for S is "specialize_#" where # starts at 1 and increments by 1 every time a specialization relationship is created in the model schema. 2. S connects E (as the superclass) with itself (as the subclass). 3. The minimal cardinality of the subclass of S is the minimum number of the concept nodes of E referring to each concept node of E of every concept graph. 4. The maximal cardinality of the subclass of S is "m" if the number of the concept nodes of E referring to each concept node of E of every concept graph is not identical; otherwise, the maximal cardinality of the subclass of S is set as the number N. Example: 163 Assume the first refer-to link in the concept graphs for Example 1 and 2 (as shown in Figure 4.16) is from Primary-Key to Ai. Since the entity-classes corresponding to the origin and to the destination concept nodes of this refer-to link are Primary-Key and construct_l which are different, G-Rule 11 is applied. As a result, an association relationship relate_l which connects Primary-Key with construct_l is created in the model schema. In Example 1, the nimiber of the concept nodes of construct_l referred by the concept node of Primary-Key is I, while it is 2 in Example 2. Thus, the minimal and the maximal cardinality of construct_l on relate_l are 1 and "m", respectively. On the other hand, since not all of the concept nodes of construct_l are referred by the concept node Primary-Key in both examples, the minimal cardinality of Primary-Key on relate_l is 0. Moreover, at most one refer-to link from a concept node of Primary Key to a concept node of construct_l in both examples; thus, the maximal cardinality of Primary-Key on relate_l is 1. After the association relationship relate_l is created for the refer-to link from the concept node of Primary-Key to the concept node of construct_l, no action needs to be taken for the refer-to links from Primary-Key to A2 (in Example 1) and from Primary-Key to Aj (in Example 2). Since the concept node Foreign-Key (in Example 2) has three refer-to links: two to the concept nodes of construct_l and one to the concept node of PrimaryKey. Therefore, two association-relationships relate_2 (connecting Foreign_Key with construct_l) and relate_3 (connecting Foreign_Key with Primary_Key) will be created in the model schema. The minimal and maximal cardinalities of each of the two association 164 relationships can be determined using G-Rule 11. The resulting model schema after the generalization of refer-to links is shown in Figure 4.17. IV^etSdieiiia' Table 'consist of 1 Construct 1 (2.2) (0.1) (1.1) (3.m) (I.m) (0.1) Primary-Key (1.1) relate 3 relate 1 (0.1) relate 2 Table, name (not-null, unique, single-valued) Constnict_I. name (not-null, unique, single-valued) Construa_l. data-type (not-null, not-unique, single-valued) Constnict_l. null-spec (not-null, not-unique, single-valued) Figure 4.17: Model Schema After Generalization of Refer-to Links The generalization of refer-to links step concludes the concept generalization phase. In other words, induction and abstraction of model constructs from the training examples are achieved. By substituting the entity-class Construct_l in Figure 4.17 with a semantically meaningfiil name such as "Attribute", the induced model constructs demonstrate the key concepts of the relational data model and can be explained as follows: A table in the relational data model consists of three or more attributes (i.e., Construct_l in the diagram), one and only one primary key, and zero or one foreignkey. Each primary key is defined on (i.e., relate_l) one or many attributes, but an attribute can be associated with at most one primary key. On the other hand, a foreign key refers to (i.e., relate_3) one and only one primary key, while a primary key is referenced by zero or one foreign key. Finally, a foreign key composes of (i.e., relate_2) two attributes, and an attribute can be included in at most one foreign key. 165 The induced model constructs and the relationships among them conform to the general definition of the relational data model. However, some cardinalities are problematic. For example, the minimal cardinality of the component-class Attribute on consist_of_l and the maximal cardinality of the component-class Foreign-Key on consist_of_l are not conect with respect to the general definition of the relational data model, but are accurate with respect to the training examples employed. Such incorrect cardinalities are resulted firom the representativeness of training examples and can be fixed when more representative training examples are included in the induction process. 4.2.3 Constraint Generation Phase This phase is to generate explicit model constraints and implicit model constraints pertaining to the model constructs induced from training examples. Two major types of explicit model constraints are considered in this phase: domain and inclusion model constraints. A domain constraint specifies the possible values an attribute can have [EN94], while an inclusion constraint specifies an inclusion restriction between an association relationship connecting component-classes in an aggregation relationship and this aggregation relationship. Take a relational data model for example, attribute(s) serving as the primary key of a table are the attribute(s) of the same table (i.e., the primary key attributes of a table is a subset of the attributes of the table on which the primary key is defined). Although domain and inclusion model constraints are not inherent to the metamodel constructs, training examples from which the model constructs are generalized afford the inductive possibility. On the other hand, implicit model constraints are instantiations of the metamodel semantics triggered by the instantiations 166 of metamodel constructs into model constructs (as depicted in Figure 3.5). The algorithm for instantiating from the metamodel semantics into implicit model constraints has been specified in Algorithm 3.1. During the generalization of property nodes step, attributes of entity-classes in the model schema are generalized from property nodes in concept graphs, and the property generalization hierarchies are updated accordingly. Therefore, the possible values an attribute can take are all leaf nodes of the property generalization hierarchy to which the attribute (i.e., all property nodes from which the attribute is generalized) corresponds (i.e., the root node of the property generalization hierarchy is the same as the name of the attribute). However, the incorporation of all leaf nodes of a property generalization hierarchy into a domain constraint may cause the problem of over-specialized. For example, assume an attribute correspond to the property generalization hierarchy of data­ type and some of its leaf nodes include char(5), char(lO), char(15), etc. If all of the leaf nodes in the property generalization hierarchy are included in the domain constraint of this attribute, one will see many values for character strings of different length (e.g., char(5), char(lO), char(15), etc.) in the domain constraint. If char(5), char(lO), char(15), etc. can be substituted by the property value char(n) where n > 0, the over-specialized problem can be alleviated. A solution towards to this end is as follows. If a set of leaf nodes in a property generalization hierarchy share the same non-root parent node, the non-root parent node (in replacement of the set of child leaf nodes) will be included in a domain constraint. For those leaf nodes whose parent nodes are the root nodes of 167 property generalization hierarchies, further generalization possibility does not exist; thus, they will appear in domain constraints as they are. Inclusion constraints can be induced from training examples by checking whether all refer-to links from which an association relationship is generalized are all intra-concept graph refer-to links. If so, an inclusion constraint will be created on the entity-class which is the lowest aggregate-class, in the aggregation hierarchy, of the entity-classes related by the association relationship. The discussion of generic form of inclusion constraints will be deferred until the Constraint Generalization Rule 2 (CG-Rule 2). Accordingly, the constraint generalization phase includes explicit model constraint induction and inherent model constraint instantiation. Its algorithm is summarized in Algorithm 4.2, followed by the constraint generation rules employed in the algorithm. 168 /* Input: Model schema M, metamodel semantics S and property generalization hierarchies P. Result: M with explicit and implicit model constraints */ Begin For each model construct E of M > For each attribute A of E Apply constraint generation rule CG-Rule 1 on A. > If E is an entity-class For each metamodel semantics T in the metamodel construct Class of S Instantiate(T, E). /* see Algorithm 3.1 */ > If E is an association relationship Apply constraint generation rule CG-Rule 2 on E. For each metamodel semantics T in the metamodel construct Association of S Instantiate(T, E). > If E is an aggregation relationship For each metamodel semantics T in the metamodel construct Aggregation of S Instantiate(T, E). > IfE is a specialization relationship For each metamodel semantics T in the metamodel construct Specialization of S Instantiate(T, E). End. Algorithm 4.2: Constraint Generalization Process Constraint Generation Rule 1 (CG-Rule 1): Property Generalization Hierarchy -> Domain Constraint If the attribute A of the model construct E is not the name attribute, create the following domain constraint in E: Va: e_name => a.a_name e value-set where e_name is the name of E, a_name is the name of A, and 169 value-set is the collection of the leaf nodes whose parents are the root node of the property generalization hierarchy to which A belongs and the immediate parents of the leaf nodes whose parents are not the root node of the property generalization hierarchy. CG-Rule 2: Intra-structural Concept Graph Refer-to Links -> Inclusion Constraint If all refer-to links from which the association relationship R is generalized (R connects the entity-classes S to T) are intra-concept graph refer-to links, create the following inclusion constraint in the lowest common aggregate-class C of S and T; Vc: cname, Ve: c.sname, A = e.mame.tname => Ac c.T where cname is the name of entity-class C, sname is the name of the entity-class S, tname is the name of the entity-class T, and mame is the name of the association relationship R. Example: Since two non-name attributes exist m the entity-class construct_l, two domain constraints are created and encapsulated in the entity-class construct_l (according to CGRule 1). Va: construct_l => a.data-type 6 {int, char(n), real} Va: construct_l => a.null-spec 6 {not null, null} An examination of the concept graphs for Example 1 and 2 (as shown in the upper portion of Figure 4.16) indicates that all of the refer-to links originated from PrimaryKey are intra-concept graph refer-to links. According to CG-Rule 2, an inclusion constraint (as shown below) is generated and encapsulated in the entity-class Table which is the lowest common aggregate-class of the entity-classes for the origin and destination concept nodes of these refer-to links (i.e., Primary-Key and construct_l in the 170 model schema). Analogously, the intra-concept graph refer-to links from Foreign Key to BI of T2 and B2 of T2 also suggests an inclusion constraint be created in the entityclass Table. These two inclusion constraints are listed in the following: Vc: Table, Ve: c.Primary-Key, A = e.relate_l.construct_l => A c c.construct_l Vc: Table, Ve: c.Foreign-Key, A = e.relate_2.construct_l A c c.construct_l By applying Algorithm 3.1, nineteen inherent model constraints are instantiated from the implicit metamodel constraints of the metamodel schema. For example, the inherent model constraints 5) and 6) in Table 4.3 state that for the names of tables must be unique and each table needs to have a table name. The explicit model constraints induced from examples and the inherent model constraints instantiated from the implicit metamodel constraints in the constraint generation phase are listed in Table 4.3. 171 Domain Constraints Inclusion Constraints Inherent Model Constraints 1. V a:construct_l =:> a.data-type e {int, char(n), real} 2. V a:construct_l => a.null-spec e (null, not-null} 1. V c:Table, V e:c.Primary-Key, A=e.reIate_l.construct_l => a c c.construct_l 2. V crTable, V e:c.Foreign-Key, A=e.reIate_2.construct_I A c c.construct_l 1. V ol:construct_l, V o2:construct_l, ol ?io2 => 01.name * o2.name 2. V oI:construct_I =:> count(ol.name) = I 3. V ol:construct_l => count(ol.data-type) = 1 4. V ol:construct_l => count(oI.null-spec) = I 5. V o I :Table, V o2:Table, o I o2 => o I .name o2.nanie 6. V ohTable => count(ol.name) = 1 7. V ol:Table => count(ol->consist_of_l->construct_l) >3 8. V ol:Table => count(oI->consist_of_I->Priniary-Key) = 1 9. V olrTable count(oI->consist_of_l->Foreign-Key) > 0 10. V olrTable => count(ol-*consist_of_I->Foreign-Key) < I 11. V ol:Primary-Key => count(oI->relate_I-+construct_I) S 1 12. V ol:construct_I => count(ol->reIate_l^Primary-Key) >. 0 13. V ol:construct_l => count(ol-^relate_l->Primary-Key) < I 14. V oI:Foreign-Key => count(ol->relate_2->construct_l) = 2 15. V ol:construct_l =:> count(ol->reIate_2->Foreign-Key) > 0 16. V ol:construct_l => count(ol->reIate_2^Foreign-Key) < 1 17. V oI:Foreign-Key => count(ol->reIate_3->Primary-Key) = 1 18. V ol:Primary-Key => count(oI^reIate_3->Foreign-Key) > 0 19. V ol:Primary-Key => count(ol->relate_3->Foreign-Key) < 1 Table 4.3: Model Constraints (Example 1 and 2) After Constraint Generation 4.3 Time Complexity Analysis of Abstraction Induction Technique The time complexity of each step in Abstraction Induction Technique for inductive metamodeling is determined by a number of factors, including 1) the number of training examples, 2) the number of property nodes on a concept graph, 3) the number of leaf concept nodes on a concept graph, 4) the number of non-leaf concept nodes on a concept graph, 5) the number of refer-to links on a concept node, 6) the number of property hierarchies, 7) the number of property value on a property hierarchy, 8) the number of 172 model constructs in the model schema being induced, 9) the number of relationships among model constructs, and 10) the number of attributes of a model construct. The incorporation of all above-mentioned factors in the time complexity analysis will conceal the most important factors to be studied in complicated order of magnitude functions. Thus, only the most important factors will be selected in this time complexity analysis, including the number of training examples (n), the number of model constructs in the model schema being induced (m), and the number of relationships among model constructs (r). The time complexity of Abstraction Induction Technique is summarized in Table 4.4. The order of magnitude of the concept hierarchy enhancement being O(n^) stems from the need to examine all concept hierarchies (non-leaf nodes) for every leaf node in each concept hierarchy for potential reference relationships. The need to construct a well-connected similarity graph for all non-leaf concept nodes of all concept graphs results in the order of magnitude of the generalization of non-leaf concept nodes being O(n^). The orders of magnitude of the remaining steps in Abstraction Induction Technique are obvious and wall not be explained further. Phase Concept Decomposition Concept Generalization Constraint Generation Step Concept hierarchy creation Concept hierarchy enhancement Concept graph merging Concept graph pruning Generalization of property nodes Generalization of leaf concept nodes Generalization of non-leaf concept nodes Generalization of refer-to links Order of Magnitude 0(n) O(n') 0(n) 0(nO 0(n) 0(nm) 0(n') 0(nr) 0(m+r) Table 4.4: Time Complexity of Each Step in Abstraction Induction Technique 173 The overall time complexity of Abstraction Induction Technique is 0(nm+nVnr). If the number of training examples is the most concerned factor, the overall time complexity of Abstraction Induction Technique is O(n^). 4.4 Evaluation of Abstraction Induction Technique In order to evaluate Abstraction Induction Technique for inductive metamodeling, a prototype of this technique has been implemented. Three evaluation studies will be conducted in this section. The model schema induced from each of the studies will be compared with a reference model schema corresponding to the data model employed in the study. The reference model schema conforms to the general definition of the data model and is engineered by an expert model engineer. An inconsistency between these two model schemas can be classified into one of the followings: 1. Incorrect: A specification in the induced model schema is identified in the reference model schema. However, its details are not identical to its counterpart in the reference model schema. 2. Missing: A specification in the reference model schema cannot be found in the induced model schema. 3. Excessive: A specification in the induced model schema cannot be found in the reference model schema. Evaluation Study 1: Inducing Relational Model Schema From A University HealthCenter Database 174 The first evaluation study is based on a relational database for a university health center. The health-center database is mainly for maintaining patient and diagnosis information. It consists of six relational tables: Patient, Doctor, Drugs, Treatment, Dispense, and Specialties. These relational tables as training examples for inducing the relational model schema are represented in a Structured Query Language (i.e., DDL of the relational data model). The detailed inputs of and output from Abstraction Induction Technique are listed in Appendix D. The induced relational model schema is graphically shown in Figure 4.18. The model construct "Construct_l" is in fact the model construct "Attribute" according to the relational data model terminology. The reference relational model schema which conforms to the general definition of a relational data model is shown in Appendix B and Figure 3.9. Table consist of 1 (1.1) Construct 1 (1.1) (l.m) (0. 1) Primary-Key 7rTrC>(0, in)| relate 3 relate I Foreign-Key (O.I) relate_2 Table.naine (not-null, unique, single-valued) Constnict_l.naine (not-null, unique, single-valued) Construa_I .data-type (not-null, not-unique, single-valued) Construct_I .null-spec (not-null, not-unique, single-valued) Figure 4.18: Relational Model Schema Induced from University Health Center Database The summary of this evaluation study is provided in Table 4.5. The induced relational model schema is consistent with the reference relational model schema in terms of model constructs and their relationships but contains inconsistencies in the remaining components of the induced model schema. It consists of two uicorrect cardinality 175 specifications. For example, by the definition of the relational data model, a table contains one or more attributes. However, since all training examples happen to have two or more attributes (i.e., Construct_l), the minimal cardinality on the model construct "Construct_r' in relation to the model construct "Table" is over-specialized as 2. As mentioned at the end of the concept generalization phase in Section 4.2.2.4, this overspecialized problem results from the representativeness of the training examples employed in this evaluation study. Thus, when more representative training examples are provided, such incorrectness problem will not be present in the induced model schema. On the other hand, two attributes (i.e., IsUnique and DefaultValue) of the model construct "Attribute" in the reference model schema are missing in the induced model schema. This may be due to the representativeness of training examples or lack of these specifications in the data model with which the training examples were defined. In the former case, the incompleteness of the induced model schema as compared to the reference one should be treated as an error, while it is not an error in the later case. All inconsistencies in implicit model constraints can be attributed to the incorrect cardinalities and the incomplete attributes of model constructs in the induced model schema. The induced model schema contains an excessive explicit model constraint (the domain constraint for DataType). Its absence in the reference relational model schema is because this explicit model constraint is system-dependent (varying from one DBMS to another) while the reference model schema is not. Thus, it should not considered as an error in the induced model schema. 176 Induced Relational Model Schema Model Construct Relationship between Model Constructs Cardinality Statistics Description/Causes 4 4 Perfect match. Perfect match. 18 — Incorrect: 2 Attribute of Model Construct 4 - Missing: 2 Explicit Model Constraint 4 - Excessive: 1 Implicit Model Constraint 17 - Incorrect: 2 - Missing: 3 minimal cardinality between Table and Construct_l and maximal cardinality on Construct l to Foreign-Key. Missing IsUnique and DefaultValue in Attribute model construct. Domain constraint for DataType is not shown in the reference relational model schema. Resulted from incorrect cardinalities. Resulted from missing attributes. Table 4.5: Summary of Evaluation Study 1 (Relational Model Schema) The precision rates of this evaluation study, if all of the specifications in a model schema are equally weighted, are shown in Table 4.6. As shown in the table, the worst precision rate of this evaluation study is 82.35% (the number of consistent specifications divided by the total number of specifications) when missing attributes are considered as erroneous induction results and inconsistent implicit model constraints are treated as independent errors from incorrect cardinalities and incomplete attributes. However, it is more reasonable to exclude inconsistent implicit model constraints from the computation of precision rate because inconsistent implicit model constraints can be attributed to and are derived from incorrect cardinalities and incomplete attributes of model constructs in the induced model schema. Thus, the precision rates of the evaluation study are 88.24% 177 (when missing attributes are treated as erroneous induction results) and 94.12% (when missing attributes are not erroneous induction results). According to these precision rates, it can be concluded that Abstraction Induction Technique produces a satisfactory relational model schema in the evaluation study 1. Implicit Model Constraints Are Included Implicit Model Constraints Are Not Included Missing Attributes Are Errors 82.35% 88.24% Missing Attributes Are Not Errors 90.20% 94.12% Table 4.6: Precision Rates of Abstraction Induction Technique in Evaluation Study 1 Evaluation Study 2: Inducing Network Model Schema From A Hypothetical Company Database Training examples in the second evaluation study are extracted from a hypothetical company database schema (in the network data model) depicted in [EN94]. This company database maintains uiformation concerning employees, their departments and supervisors, projects they are working on, and their dependents. The external representation of the training examples is CODASYL DBTF DDL-like language. The detailed inputs of and output from Abstraction Induction Technique are listed in Appendix E. The reference network model schema is graphically depicted in Figure 4.19, and the network model schema induced from hypothetical company database is shown in Figure 4.20. the training examples in the 178 Set Record name consist of (l.m) (l.m) name ^Ijata-typ?^^--" Attribute (I.m)\/ (0.1) Not-Duplicate rerer_by Record.name (not-null, unique, single-valued) SeLname(not-null, unique, single-valued) Attribute.name (not-null, unique, single-valued) Attribute.data-type (not-null, not-unique, single-valued) Figure 4.19: Reference Network Model Schema relate 2 (I.I) name (l.m) (3.m) Construct 1 Set (0.1) Not-Duplicate relate I Record.nanie (not-null, unique, single-valued) SeLname(not-nulI, unique, single-valued) Construct_I.name (not-null, unique, single-valued) Construct_l .data-type (not-null, not-unique, single-valued) Figure 4.20: Network Model Schema Induced from Hypothetical Company Database The model construct "Construct_l" in the induced network model schema corresponds to the model construct "Attribute" in the reference network model schema. In terms of model constructs, their attributes and relationships among model constructs, the induced network model schema is consistent with the reference network model schema. However, the induced network model schema consists of two incorrect cardinality specifications: the maximal cardinality on the model construct "Set" in relation to the model construct "Record" and the minimal cardinality on the model construct "Construct_r' in relation to the model construct "Record". As discussed previously, inconsistent implicit model constraints are derived errors; hence, they will not be 179 included in the evaluation summary of this study as shown in Table 4.7. Since the precision rate of this study is 91.30% (i.e., (4+3+10+4)-r(4+3+12+4)), Abstraction Induction Technique produces a satisfactory network model schema in this evaluation study. Induced Network Model Schema Model Construct Relationship between Model Constructs Cardinality Attribute of Model Construct Statistics Description/Causes 4 3 Perfect match. Perfect match. 12 - Incorrect: 2 4 minimal cardinality between Record and Construct_l and maximal cardinality on Set to Record. Perfect match. Table 4.7: Summary of Evaluation Study 2 (Network Model Schema) Evaluation Study 3: Inducing Hierarchical Model Schema From A Hypothetical Company Database The third evaluation study is based on the same hypothetical company database as used in the evaluation study 2. But, the application schema is represented in the hierarchical data model and expressed in a hierarchical data definitional language as shown in [EN94]. The detailed input of and output from Abstraction Induction Technique are listed in Appendix F. The reference hierarchical model schema is shown in Figure 4.21, while the hierarchical model schema induced from the hypothetical company database is shown in Figure 4.22. 180 Hierarchy (O.I) has root CT ""n't point_to Record refer_to (I.I) {I. I) ^ C0ini5t_0f (O.m) Pointer (O.m) ^^~name ^data-typc^ (O.m) (O.m) Attribute ( l . m ) ' 0 ' (0.1) Key (O.m) (0.1) Parent composed Hierarchy.name (not-null, unique, single-valued) Record.name (not-null, unique, single-valued) Setname(not-null, unique, single-valued) Attribute.name (not-null, unique, single-valued) Attribute.data-type (not-null, not-unique, single-valued) Figure 4.21: Reference Hierarchical Model Schema Hierarchy relate 1 relate 4 relate 3 Record •consist_of_l name Pointer C^ata-type^!^—' Construct 1 (1.1)0(0. 1) relate 2 Key Parent Hierarchy.name (not-null, unique, single-valued) Record.name (not-null, unique, single-valued) SeLname(not-null, unique, single-valued) Construct_I.name (not-null, unique, single-valued) Construct_I.data-type (not-null, not-unique, single-valued) Figure 4.22: Hierarchical Model Schema Induced from Hypothetical Company Database The model construct "Construct_l" in the induced hierarchical model schema corresponds to the model construct "Attribute" in the reference hierarchical model schema. In terms of model constructs, their attributes and relationships among model constructs, the induced hierarchical model schema is consistent with the reference 181 hierarchical model schema. However, the induced hierarchical model schema consists of five incorrect cardinality specifications as depicted in the evaluation summary of this study, as shown in Table 4.8 The satisfactory precision rate of this study (i.e., (6+5+19+4)-r(6+5+24+4) = 87.18%) indicates that Abstraction Induction Technique is effective in producing a hierarchical model schema in this evaluation study. Induced Hierarchical Model Schema Model Construct Relationship between Model Constructs Cardinality Statistics Description/Causes 6 5 Perfect match. Perfect match. 24 - Incorrect; 5 4 Attribute of Model Construct 1) Minimal cardinality on "Record" to "Hierarchy", 2) maximal cardinality on "Pointer" to "Record" via "relate_4", 3) maximal cardinality on "Pointer" to "Record" via "consist_of_l", 4) maximal cardinality on "Construct_l" to "Key", and 5) maximal cardinality on "Parent" to "Record" via "consist_of_r'. Perfect match. Table 4.8: Summary of Evaluation Study 3 (Hierarchical Model Schema) Summary of Evaluation Studies: The three evaluation studies are summarized in Table 4.9. As discussed above, the induced model schema in each evaluation study is consistent with its corresponding reference model schema in terms of model constructs, their attributes, and relationships 182 between model constructs. Cardinality specifications cannot accurately be induced in all evaluation studies. To improve the precision rate of Abstraction Induction Technique for inductive metamodeling, generalization of maximal and minimal cardinalities from training examples desires further research. Data Model to be Induced Study 1 Relational Study 2 Network Study 3 Hierarchical Application Schema # of Training Examples University Health Center Database Hypothetical Company Database Hypothetical Company Database 6 8 9 Precision Rate (excluding model constraints) 94.12% (only consisting of 2 incorrect cardinalities) 91.30% (only consisting of 2 incorrect cardinalities) 87.18% (only consisting of 5 incorrect cardinalities) Table 4.9: Summary of Three Evaluation Studies 183 CHAPTER 5 Construct Equivalence Assertion Language A construct equivalence representation is one of the essential components in the construct-equivalence-based methodology for schema translation and schema normalization. This chapter is devoted to the development of a construct equivalence representation. Construct Equivalence Assertion Language (CEAL). Design principles for the development of a construct equivalence representation are specified first, followed by the detailed CEAL syntactic specifications. The execution semantics of intra-model and inter-model construct equivalences specified in CEAL will also be defined. A construct equivalence transformation function, which is the building block of the construct equivalence transformation method, will be developed in this chapter. Finally, CEAL will be evaluated according to its design principles. 5.1 Design Principles for A Construct Equivalence Representation A construct equivalence representation serves as the formal specification for intra-model and inter-model construct equivalences required by schema translation and schema normalization. The design principles for a construct equivalence representation are defined as follows: 1. Declarative: 184 The construct equivalence representation must be declarative in nature. As mentioned in Section 2.4, the method for reasoning on intra-model and inter-model construct equivalences for schema translation and schema normalization is separated from the representation. 2. Supporting bi-directional construct equivalences: A bi-directional construct equivalence refers to the ability of interpreting and reasoning from both directions of the construct equivalence. As mentioned earlier, this property is important to the construct-equivalence-based schema translation in order to avoid the need for specifying two sets of unit-directional translation knowledge between two data models. It is even more important when considering the reusability of intra-model construct equivalences of a data model during defining translation knowledge between the data model and different data models and when considering the use of intra-model construct equivalences in schema normalization because the actual transformation direction is determined by normalization criteria which can be any combination of undesired model constructs. 3. Instantiable by applications constructs: Since construct equivalences are defined at the model level, they will be instantiated by application constructs in an application schema during the schema translation and schema normalization process. Therefore, the use of variables to be instantiated by application constructs in the construct equivalence representation are inevitable. 4. Capable of specifying various nimiber of constructs in a construct equivalence: Based on the number of constructs involved in each side of a construct equivalence, three types of construct equivalences can be identified: 1) one-to-one which involves 185 only one model construct on each side of a construct equivalence, 2) one-to-many (or many-to-one) which involves one model construct on one side of a construct equivalence and multiple model constructs on the other side, and 3) many-to-many. The construct equivalence representation needs to be capable of specifying these three types of construct equivalences. 5. Capable of expressing part of a model construct in a construct equivalence: A construct equivalence may involve model constructs which partially participate in the construct equivalence. A partial model construct in a construct equivalence denotes that not all but only some of its instances will participate in the construct equivalence. A partial model construct can be viewed as the model construct associated with certain restriction(s). For example, "single-valued attribute" is the attribute model construct with the restriction on the multiplicity property as "singlevalued." Thus, the construct equivalence representation needs to allow the expression of restrictions on model constructs when partial model constructs in construct equivalences are required. 6. Allowing the expression of multiple instances of the same model construct in a construct equivalence: Multiple mstances of the same model construct may be involved in a construct equivalence. For example, "a set of relations in the relational data model" refers to multiple instances of the relation model construct in the relational data model. Since the number of instances of a model construct is not static and will dynamically be determined at the schema translation or schema normalization process, the construct equivalence representation needs to be flexible enough to deal with this dynamics. 186 7. Capable of defining detailed correspondences for construct equivalences: Defining construct equivalences only at the construct level is not sufficient to describe the full spectrum of construct equivalences. As mentioned, each model construct in a model schema is described by some attributes (i.e., properties) and has relationships with other model constructs. Detailed correspondences on these two levels are required when specifying construct equivalences. For example, when specifying a non-foreign key attribute in the relational model which is equivalent to a single-valued attribute in the SOOER model, correspondences between properties (e.g., name, null specification, uniqueness specification, etc.) of the single-valued attribute in the SOOER model and the non-foreign key attribute in the relational data model need to be deliberated as well. Thus, the construct equivalence representation should allow specifying and being able to reason on detailed correspondences. 8. Single language for both intra-model and inter-model construct equivalences: The construct equivalence representation should serve as the specification language for both intra-model and inter-model construct equivalences. 5.2 Development of Construct Equivalence Assertion Language The section details the development of a construct equivalence representation. Construct Equivalence Assertion Language (CEAL). In this section, all examples illustrating the use of CEAL for specifying inter-model or intra-model construct equivalences are based on the SOOER model schema and the relational data model schema, shown in Figure 3.7 and Figure 3.9, respectively. 187 5.2.1 High-Level Syntax Structure of Construct Equivalence Assertion Language Construct Equivalence Assertion Language (CEAL) is used to express intra-model or inter-model construct equivalences each of which follows the syntax of: construct-set s constaict-set WITH construct-correspondence [ HAVING ancillary-description] where = refers to "is equivalent to" and [ ] denotes "optional". A construct equivalence defined in the above syntax is interpreted as "two construct-sets are equivalent with specific correspondences defined in the construct-correspondence clause." As discussed, a construct equivalence transformation method is a necessity for achieving bi-directional construct equivalences. The construct equivalence transformation method determines the direction of each construct equivalence and employs a construct equivalence transformation function for exchanging the LHS with the RHS construct-set of the construct equivalence. The construct equivalence transformation function needs to ensure the validity of transformed construct equivalences. Moreover, it must be an information preserving fimction; that is, the result of exchanging construct-sets on the construct equivalence which has already been exchanged should be the same as the original construct equivalence. However, as will be seen shortly, a LHS construct-set may be associated with some selection conditions, while a RHS construct-set cannot. Some but not all of the selection conditions on the LHS construct-set can be moved to the construct-correspondence clause when performing the construct-set exchange between two sides of a construct equivalence. 188 Ancillary-description is reserved for those selection-conditions which are initially associated with the LHS construct-set but cannot be specified in the constructcorrespondence clause after the construct-set exchange is performed. The ancillarydescription clause is optional since there may not exist any ancillary descriptions in a construct equivalence. 5.2.2 Definition of Constnict-Set The basic building blocks of a construct-set in a construct equivalence are constructinstance and construct-instance-set. A construct-set consists of a set of construct- instances and/or construct-instance-sets connected by AND operators. A construct- instance specified in the LHS of a construct equivalence provides a way to reference each instance of a construct-domain (i.e., an instance of a model construct in an application schema) satisfying certain selection conditions (called instance-selectioncondition). On the other hand, a construct-instance specified in the right-hand-side (RHS) of a construct equivalence refers to a particular instance or implies a new instance of a construct domain. A particular instance or a new instance of a construct-domain referred by a RHS construct instance can be specified in the construct-correspondence clause which will be discussed later. Thus, instance-selection-condition applicable to a LHS construct-instance is not allowed by a RHS construct-instance. Furthermore, since a LHS construct-instance is always qualified universally and a RHS construct-instance is always qualified existentially, both the universal qualifier (i.e., V) and the existential 189 qualifier (i.e., 3) are omitted in the construct equivalence assertion language. The general expression of a construct-instance is: construct-instance: constaict-domain [ (WHERE instance-selection-condition) ] WHERE clause is optional for a LHS construct-mstance but not allowed in a RHS construct-instance. On the other hand, a construct-instance-set refers to a set of instances of the same construct-domain. Thus, two levels of reference are required: an inner construct-instance for an instance in a construct-domain satisfying certain instance-selection-condition and an outer construct-instance-set for a set of such construct-instances from the same construct-domain satisfying certain set-selection-condition. Same as discussed above, a LHS construct-instance-set may be associated with an instance-selection-condition and/or a set-selection-condition, but none of them can be associated to a RHS. The general expression of a construct-instance-set is specified as: constaict-instance-set: {construct-instance: construct-domain [ (WHERE instance-selection-condition) ]} [ (WHERE set-selection-condition) ]} Both WHERE clauses are optional for a LHS construct-instance-set but not allowed in a RHS construct-instance-set. Reference to a construct-instance-set in the construct equivalence refers to all instances in the construct-instance-set. The reference protocol to the inner construct-instance of a construct-instance-set is defined as follows. When the iimer construct-instance is referenced, this reference intends to each instance in the construct-instance-set. When a set operator (i.e., n, u, or -) is applied on the irmer construct-instance, it returns the 190 result of applying the set-operator on all instances in the construct-instance-set. A set comparison operator on the inner construct-instance returns true if all pairs of instances in the construct-instance-set satisfy this set comparison operator; otherwise, it returns false. Because the reference order to the instances in the construct-instance-set is undefined, set operators and set comparison operators with the commutative property are applicable to the inner construct-instance. Thus, the set difference operator and the subset or superset comparison operators cannot be applied on the irmer constructinstance. Moreover, because set operators and set comparison operators are binary operations, the traditional convention of expressing set operators or set comparison operators (e.g., A r» B, A = B, etc.) is not appropriate to express a set operator or a set comparison operator on a single inner construct-instance. Thus, a new expression of a set operator or a set comparison operator is defined as <set-op> I <comp-op>(inner-construct-instance[.path.model-construct]) where | denotes "or", <set-op> is an applicable set operator (union and intersect are used for and n, respectively), and <comp-op> is a set comparison operator (equal and not_equal are used for = and ^). Constnict-Domain; A construct-domain can be a simple or complex construct-domain. A simple construct« domain refers to a model construct in a model schema and be either directly-referenced or path-referenced. A directly-referenced construct-domain addresses a model construct by its name, while a path-referenced construct-domain, similar to the path-expression of the constraint-specification language of the SOOER model defined in Section 3.2.2, 191 traverses a model schema via a path of relationships from a construct-instance to a model construct (called the terminal of the path) in the model schema. A simple constructdomain is formally expressed as: [model-name.]model-construct | construct-instance.path.model-construct The former is for a directly-referenced construct-domain, while the latter is for a pathreferenced construct domain. If a directly-referenced construct-domain appears in an inter-model construct equivalence, it is necessary to include a model name; otherwise, the model name can be omitted. However, a path-referenced construct domain need not be signified by a model name because its leading construct-instance can be used to trace on which model schema this path-referenced construct domain is defined. On the other hand, a complex construct-domain restricts a simple construct-domain by other simple construct-domains referring to the same model construct as the first simple construct-domain. Thus, a complex construct-domain is composed of a set of simple construct-domains connected by either set intersection or difference operators. union operator cannot be used since it is not a set-reduction operator. The All simple construct-domains in a complex construct-domain should refer to the same model construct. Since a complex construct-domain defines a selection condition at the model construct level, it can appear only in the LHS of an construct equivalence (i.e., a LHS construct-instance may be defined on a complex construct-domain, but a RHS constructinstance must be defined on a simple construct-domain). 192 1. R: Relational.Relation R is an construct-instance which denotes each relation in the relational data model. 2. A: Relational-Attribute - Relational.Foreign-Key.Attribute A denotes each non foreign key attribute (specified as Relational-Attribute Relational.Foreign-Key.Attribute) in the relational data model. 3. S: {E: SOOER.Entity-Class} S is a construct-instance-set which consist of a set of entity-classes (denoted by {E: SOOER.Entity-Class}) in the SOOER model. Instance-Selection-Condition of LHS Construct-Instances: Instance-selection-condition enclosed in the WHERE clause of a construct-instance further restricts the instances of the construct-domain to be referenced by the constructinstance. For example, "each single-valued attribute in the SOOER model" is specified as "A; SOOER.Attribute (WHERE A.Multiplicity = 'single-valued')." A; SOOER.Attribute refers to each instance of the attribute model construct in an application schema, regardless whether it is single- or multi-valued (i.e., A is a single- or multi-valued application construct which is an instance of the attribute model construct). The "WHERE A.Multiplicity = 'single-valued'" clause is an instance-selection-condition on A and as a result restricts A to instances of single-valued attributes. Instance- selection-condition defined for its preceding construct-instance is a conjunction of selection clauses each of which can be one of the following types: 1. Property selection: A property selection is used to select a subset of instances in a construct-domain whose property is equal to a particular value. The single-valued attribute example described above illustrates the meaning and use of a property selection. Formally, a property selection is expressed as: 193 cinst-property = value where cinst is the construct-instance on which the property selection is defined and property is a property of cinst. Path selection: A path selection, used to select a subset of instances in a constructdomain whose path(s) exhibits certain characteristics, can be expressed in one of the following forms: • • • cinsti .path.model-construct <set-comp> set1 cinsti .path.model-construct [ <set-op> set2 ... ] <set-comp> set3 | 0 or cInstI.path.model-construct <set-op> set2 ... <set-comp> set3 agg-functlon(clnst1.path.model-construct [ <set-op> set2 ... ]) <comp-op> agg-function(set3 [ <set-op> set4 ...]) j value where - cinsti is the construct-instance on which the path selection is defined. - setl refers to a set of instances. It can be a set of construct-instance(s) enclosed in { }, a construct-instance-set, a path-referenced model construct originating from a construct-instance possibly different from cinsti. - set2, set3 or set4 refers to a set of instances. It can be a construct-instance-set, a path-referenced model construct originating from a construct-instance possibly different from cinsti. - <set-comp> is a set comparison operator including =, 3,2, e, and c. - <set-op> is a set-reduction operator including n and -. - <comp-op> is a value comparison operator including =, >, >, <, and <. - agg-function() is an aggregation flmction on a set including count() and countdistinct(). A path selection expressed in the first form specifies that an instance in the constructdomain will be included in the reference scope of the construct-instance (cinsti) if the specific set comparison (denoted by set-comp) on a path originating from the instance and a set (or the intersection or difference of multiple sets) originating from the same or different instance is satisfied. The second form performs intersection or difference operations on a path originating from an instance denoted by the construct- 194 instance with other set(s) of instances, and then compares the set operation result with another set of instances originating from the same or different construct (or an empty set). For example, that "each relation whose primary key attributes do not contain any of its foreign key attributes in the relational model" is represented as: "R: Relationai.Relation (WHERE R.Primary-Key.Attribute n R.Foreign-Key.Attribute = 0)". The last form compares the aggregate information derived from originating from a path an instance denoted by the construct-instance with another aggregate information or a value. For example, "each specialization relationship in the SOGER model which is associated with a single subclass" is specified as "S: SOOER.Specialization (WHERE count(S.subclass.Entity-Class) = 1". 3. Extension selection: An extension selection defines a restriction on the extension (i.e., actual data in database) of an application construct (i.e., instance denoted by the construct-instance or by a path originating from the construct-instance). Similar to those for path selections, three expression forms can be used for specifying extension selections: • • • ext(cinst1[.path.model-construct]) <set-comp> ext(set1 [<set-op> ext(set2)... ]) ext(cinst1[.path.model-construct] <set-op> set1 ...) <set-comp> ext(set2) | 0 ext-function(ext(cinst1[.path.model-construct] [ <set-op> set1 ...])) <comp-op> agg-functlon(ext(set2 [ <set-op> set3 ...])) 1 value where - ext(set) is the extension function. - ext-function( ) is an aggregation function on an extension including count( ), count-distinct(), has_duplicate(), and has_null(). For example, that "two relations in the relational model whose extensions on their primary key attributes are the same" is represented as: "T: Relationai.Relation AND 195 S: Relational.Relation (WHERE ext(S.Piimary-Key.Attribute) = ext(T.PrimaryKey.Attribute))". Example 1: Inter-model construct equivalence That "each relation in the relational data model whose primary key attributes do not contain any of its foreign key attributes is equivalent to an entity-class in the SOOER model...." is expressed as: R: Relational.Relation (WHERE R.Primary-Key.Attribute.Foreign-Key = 0) E: SOOER.Entity-Class WITH ... "R.Primary-Key.Attribute.Foreign-Key = 0" is a path selection of the second form. Example 2: Inter-model construct equivalence That "each non-foreign key attribute of a relation in the relational data model is equivalent to a single-valued attribute of the entity-class for the relation in the SOOER model. ..." is represented as: R: Relational.Relation AND N: R.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class AND A: E.Attribute WITH ... The construct-domain for the construct-instance N is a complex construct-domain. The LHS of the construct equivalence denotes each non-foreign key attribute N (i.e., R.Attribute - R. Foreign-Key.Attribute) of each relation R in the relational model, while 196 the RHS represents an entity-class E (i.e., SOOER.Entity-Class) and an attribute A of the entity-class (i.e., A: E.Attribute) in the SOGER model. The detailed correspondences related to the construct-instances defined in the RHS (e.g., "single-valued attribute" and "the entity-class for the relation") will be specified in the construct-correspondence clause. Example 3: Intra-model construct equivalence "A multi-valued attribute of an entity-class in the SOGER model is equivalent to an entity-class, a single-valued attribute, and an association relationship in the SOGER model. The association relationship connects the new entity-class and the entity-class to which the multi-valued attribute belongs. ..." E; SOOER.Entity-Class AND M: E.Attribute (WHERE M.Multiplicity = 'multi-valued') N: SOOER.Entity-Class AND A: Attribute AND R: SOOER.Assoclation WITH ... The LHS of the construct equivalence denotes "each multi-valued attribute M of every entity-class E in the SOGER model." The RHS of the construct equivalence involves three construct-instances and denotes "an attribute A, an entity-class N and an association relationship in the SOGER model." Again, the detailed correspondences for "single-valued attribute" and "the association relationship connecting the new entityclass and the entity-class to which the multi-valued attribute belongs" will be specified in the construct-correspondence clause. 197 Set-Selection-Condition on LHS Constnict-Instance-Sets: Set-selection-condition enclosed in the WHERE clause of a construct-instance-set identifies the maximal subset of construct-instances (from the construct-domain denoted by the inner construct-instance) to be included in the construct-instance-set or test the validity of the set of construct-instances currently included in the construct-instance-set. As mentioned, since set-selection-condition on a construct-instance-set serves as restriction criteria, it appears only in the LHS of a construct equivalence. A set- selection-condition consists of conjunctive selection clauses. If a selection clause on a construct-instance-set is defined upon the inner construct-instance (or a construct reachable from the inner construct-instance), it applies to every instance in the constructinstance-set according to the reference protocol defined above and is used to select from the construct-domain the maximal subset of construct-instances which satisfy the selection clause to be included in the construct-instance-set. If none of the possible subsets of the construct-instances in the construct-domain satisfies the selection clause, S will be an empty set. On the other hand, a selection clause defined on a constructinstance-set applies to the set of instances in the construct-instance-set. For example, assume S be a construct-instance-set. A selection clause "count(S) > 1" is defined on the construct-instance-set and requires the number of instances in S be greater than 1. If so, 8 contains all construct-instances currently included in S. If the number of instances in S is less than or equal to 1, the set of construct-instances currently included in S fails to satisfy the testing condition and S will become an empty set. 198 A selection clause in a set-selection-condition can be expressed in the following forms: 1. union \ intersection(lclnst.path.model-constaict) <set-comp> set1 [ <set-op> set2 ...] 2. union | intersection(ext(icinst.path.model-construct)) <set-comp> ext(set1 [ <set-op> set2 ...]) 3. agg-functlon(clnstset) <comp-op> value where - icinst is the inner construct-instance of the construct-instance-set on which the selection clause is defined. - set1 or set2 refers to a set of instances. It can be a set of construct-instance(s) enclosed in { }, a construct-instance-set, a path-referenced model construct originating from a construct-instance possibly different from cinstl. - <set-comp> is a set comparison operator including =, z), 3, c, and c. - <comp-op> is a value comparison operator including =, >, >, <, and <. - agg-function(), an aggregation flmction, includes count() and count-distinct(). - icinstset is the construct-instance-set on which the selection clause is defined. Example 4: Intra-model construct equivalence "In the SOGER model, a set of entity-classes having the same identifier as another entity-class whose extension on its identifier attributes is a superset of the union of that of the formers are equivalent to a new specialization relationship in which the latter is its superclass and the formers are its subclasses...." is expressed as: C: SOOER.Entity-Class AND T: {E: SOOER.Entity-Class (WHERE E.Identifier.Attribute = C.ldentlfier.Attribute)} (WHERE unlon(ext(E.ldentlfier.Attribute)) e ext(C.Identifier.Attribure)) S: SOOER.Speciailzatlon WITH ... "E: SOOER.Entity-Class (WHERE E.Identifier.Attribute = C.Identifier.Attribute)" is the iimer construct-instance of the construct-instance-set T and denotes every entity-class whose identifier attributes are the same as those of the entity-class C. The set-selection- 199 condition on T (i.e., "union(ext(E.Identifier.Attribute)) c ext(C.Identifier.Attribure)") further restricts the set of entity-classes to be included in T if the union of their extensions on identifier attributes is the same as or a subset of the extension on identifier attributes of C. The RHS of this intra-model construct equivalence denotes a specialization relationship. The detailed correspondence for "C is the superclass of T in the specialization relationship" will be specified in the construct-correspondence clause. 5.2.3 Definition of Construct-Correspondence So far, a construct equivalence is defined at the construct-set level. That is, a set of model constructs defined in the LHS of a construct equivalence are equivalent to another set of model constructs defined in the RHS. The construct-correspondence clause is employed to specify correspondences within or between construct-sets of a construct equivalence. The construct-correspondence of a construct equivalence consists of a set of correspondences each of which can be either a connection correspondence, property correspondence or relationship correspondence. For example, in Example 1, besides declaring each relation whose primary key attributes do not contain any of its foreign key attributes in the relational model being equivalent to an entity-class in the SOOER model, a correspondence at the property level (i.e., property correspondence) is needed. It is that the name of the entity-class is the same as that of the relation. Moreover, as depicted in Example 2, correspondences for "single-valued attribute A" and "the entityclass E for the relation R" (where A and E are the RHS construct-instances and R is the LHS construct-instance) need to be specified in the construct-correspondence of the 200 construct equivalence. The former is an example of property correspondence while the latter is an example of connection correspondence. That "the association relationship R connects the new entity-class N and the entity-class E to which the multi-valued attribute belongs" (where R and N are the RHS construct-instances and E is the LHS constructinstance), described in Example 3, demonstrates a relationship correspondence. Connection Correspondence: A connection correspondence specifies that a construct-instance in the RHS of a construct equivalence is corresponding to another construct-instance (or a pathreferenced model construct originating from some construct-instance) defined in the LHS of the construct equivalence. "The entity-class E for the relation R" in Example 2 indicates that E must be the entity-class corresponding to the relation R and thus is a connection correspondence. Moreover, connection correspondences can only be used in inter-model construct equivalences. The expression of an equivalence connection is formally specified as: cinsti = cinst2[.path.model-construct] where cinsti is a RHS construct-instance and cinst2 is a LHS construct-instance of an inter-model construct equivalence. Property Correspondence: A property correspondence which may appear in an inter-model or intra-model construct equivalence asserts a property of a construct-instance defined in the RHS of the construct equivalence equal to a constant value or equal to a property of another construct-instance defined in the LHS of the construct equivalence. A property correspondence may be 201 conditional; that is, the property correspondence can be established only when a condition is satisfied. A property correspondence is formally expressed as: cinsti .property = value | cinst2.property [ IF condition ] where cinsti is a construct-instance defined in the RHS of the construct equivalence. cinst2 is a construct-instance defined in the LHS of the construct equivalence. The expression of a condition is the same as those defined in property, path or extension selection. Example 5: (continued fi-om Example 2) "Each non-foreign key attribute of a relation in the relational model is equivalent to a single-valued attribute of the entity-class for the relation in the SOOER model. The name (or data-type) of the single-valued attribute is the same as that of the non-foreign key attribute. The null-specification of the single-valued attribute is determined by whether the non-foreign key attribute allows null values or not. The value of the uniqueness property of the single-valued attribute is the same as that of the IsUnique property of the non-foreign attribute." This inter-model construct equivalence is represented as: R: Relational.Relation AND N: R.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class AND A: E.Attribute WITH E = R AND A.Multiplicity = 'single-valued' AND A.Name = N.Name AND ANull-Spec = N.lsNull AND A.Uniqueness = N.IsUnique AND ADataType = N.DataType The construct-correspondence enclosed in the WITH clause of this inter-model construct equivalence specifies a connection correspondence between two construct-instances (i.e., 202 E = R which denotes the entity-class E is corresponding to the relation R), a property correspondence for single-valued attribute A (i.e., A.Multiplicity = 'single-valued') and several property correspondences between single-valued attribute A and the non-foreign key attribute N. Relationship Correspondence; A relationship correspondence specifies a change to the relationships of a constructinstance as a result of participating in a construct equivalence. That is, it defines how the content of a construct reachable from a construct-instance will be changed when the construct-instance is involved in the construct equivalence. Relationship correspondences can be used in both inter-model and intra-model construct equivalence. A relationship correspondence can be defined for a RHS or a LHS construct-instance. Identical to a property correspondence, a relationship correspondence can be conditional. A relationship correspondence is formally expressed as: cinstl.path.model-construct = set1 [ <set-op> set2 ... ] [ IF condition ] where cinsti is a construct-instance defined in the construct equivalence. set1 or set2 refers to a set of instances. It can be a set of constructinstance(s) enclosed within { }, a construct-instance-set, a path-referenced model construct originating from a construct-instance possibly different from cinsti. The expression of condition is the same as that of property, path, or extension selection. Example 6: (continued from Example 3) "A multi-valued attribute in the SOGER model is equivalent to an association relationship, an entity-class and a single-valued attribute in the SOGER model. The 203 association relationship connects the new entity-class and the entity-class to which the multi-valued attribute belongs. The single-valued attribute becomes the only attribute as well as the identifier of the new entity-class. As the identifier of the new entity-class, the uniqueness and niUl-spec properties of the single-valued attribute are 'unique' and 'notnull', respectively. If different instances of the existing entity-class can share the same value of the multi-valued attribute (i.e., the uniqueness property of multi-valued attribute is 'not-unique'), the maximal cardinality on the existing entity-class is m; otherwise, it is 1. The minimal cardinality on the existing entity-class is always I since each value of the multi-valued attribute was associated to an instance of the existing entity-class. ..." This intra-model construct-equivalence is represented as: E; Entity-Class AND M: E.Attribute (WHERE M.Multiplicity = 'multi-valued') N: Entity-Class AND A: Attribute AND R: Association WITH A.Multiplicity = 'single-valued' AND R.relate.Entity-Class = {E, N} AND N.Attribute = {A} AND N.Identifier.Attribute = N.Attribute AND AUniqueness = 'unique' AND A.Null-Spec = 'not-nuir AND R.relate.E(Max-Card) = m IF M.Uniqueness = 'not-unique' AND R.relate.E(Max-Card) = 1 IF M.Uniqueness = 'unique' AND R.relate.E(Min-Card) = 1 AND The first, fifth and sixth correspondences (i.e., A.Multiplicity = 'single-valued', A.Uniqueness = 'unique' and A.Null-Spec = 'not-null') are property correspondences defined on the properties of the attribute A and state that the attribute A is single-valued, must be unique and cannot allow null values, respectively. The second, third and fourth 204 correspondences (i.e., R.reIate.Entity-Class = {E, N}, N.Attribute = {A} and N.Identifier.Attribute = N.Attribute) are relationship correspondences which indicate that "the association relationship R must link the entity-classes E and N", that "the singlevalued attribute becomes the only attribute of the new entity-class," and that "the attribute becomes the identifier of the new entity-class," respectively. The last three correspondences define values for the maximal and minimal cardinality of the link between the entity-class E and the association relationship R. "R.relate.E(Max-Card) = m IF M.Uniqueness = 'not-unique'" is a conditional property correspondence. It depicts that the maximal cardinality between the entity-class E and the association relationship R is m (for many) if the IsUnique property of the multi-valued attribute M is 'not-unique'. Otherwise, the maximal cardinality between the entity-class E and the association relationship R is 1 (as defined in R.relate.E(Max-Card) = 1 IF M.Uniqueness = 'unique'). 5.2.4 Definition of Ancillary-Description The existence of the ancillary-description clause in a construct equivalence is to ensure the construct equivalence be bi-directional. The ancillary-description consists of a conjunction of ancillary clauses. An ancillary clause is used to contain a selection clause of an instance-selection-condition or set-selection-condition on a LHS construct-instance of a construct equivalence which caimot be specified in the construct-correspondence clause after exchanging construct-sets between two sides of the construct equivalence. Besides, there are some other sources for the ancillary-description when exchanging 205 construct-sets of a construct equivalence, Table 5.1 provides a summary of all potential sources of the ancillary-description. Original Construct Equivalence Instance-selection-condition on a LHS construct-instance - Property selection clause - Path selection clause - Extension selection clause Set-selection-condition on a LHS constructinstance-set Complex construct-domain of a LHS construct-instance Conditional construct-correspondence Ancillary-Description of Transformed Construct Equivalence no partial yes yes yes partial Table 5.1: Sources for Ancillary-Description When Exchanging Construct-Sets A property selection on a LHS construct-instance defining a value that must be satisfied by a property of each instance denoted by the construct-instance will become a property correspondence which ascertains the property value of an instance denoted by the construct-instance (formerly in the LHS) after exchanging the construct-sets of the construct equivalence. Similarly, the first form of a path selection clause with the set equality comparison operator will become a relationship correspondence in the constructcorrespondence clause of the transformed construct equivalence. Since these two types of selection clauses on LHS construct-instances can be transformed into property or relationship correspondences ui the construct-correspondence of the transformed construct equivalence, they will not be specified in the ancillary-description clause of the transformed construct equivalence. However, there exists no counterpart for the first 206 form of path selections which involve non-equality set comparison operators, the second and third forms of path selections, extension selections, and set-selection-conditions in the construct-correspondence clause. They need to be described by ancillary clauses in the ancillary-description of the transformed construct-equivalence. In addition, as mentioned in the previous subsection, a complex construct-domain can appear only in the LHS of a construct equivalence. When exchanging the construct-sets of the construct equivalence, all complex construct-domains originally defined for LHS constructmstances are not permitted in any of the RHS construct-instances of the transformed construct equivalence. As a result, complex construct-domains need to be described in the ancillary-description of the transformed construct equivalence. Moreover, some of conditional construct-correspondences in a construct equivalence cannot be transformed and specified in the construct-correspondence of the transformed construct equivalence. Thus, they represent another source for the ancillary-description when exchanging construct-sets of a construct equivalence. The detailed discussion of the construct equivalence transformation function (i.e., exchanging construct-sets of a construct equivalence) will be deferred until Section 5.3. The definition of ancillary-description concludes the syntactic specifications of CEAL. To demonstrate the expressiveness of CEAL, the inter-model construct equivalences between the EER and SOOER models (whose model schemas are shown in Figure 3.9 and Figure 3.7 respectively) and some of the intra-model construct equivalences of the SOOER model are engineered and listed in Appendix G and Appendix H. 207 5.2.5 Execution Semantics of Construct Equivalences Once the syntax of CEAL is defined, the execution semantics of inter-model and intramodel construct equivalences expressed in CEAL need to be explicitly specified. As shown in Figure 2.8, inter-model construct equivalences are used in the source-target projection stage of the construct-equivalence-based schema translation. Thus, LHSs of inter-model construct equivalences represent instantiations from application constructs in the source application schema, while RHSs of inter-model construct equivalences may involve mstantiations of application constructs in the target application schema and result in the creation of corresponding application constructs in the target application schema. The detailed execution semantics of inter-model construct equivalences is depicted as follows: 1. Instantiation of LHS construct-instances from the source application schema: A LHS construct-instance refers to each application construct which belongs to the construct-domain (directly-referenced, path-referenced, or complex) of the constructinstance and satisfies the instance-selection-condition of the construct-instance. A LHS construct-instance-set refers to the maximal subset of the application constructs (referenced by its inner construct-instance) which satisfies the set-selection-condition of the construct-instance-set. 2. Instantiation of RHS construct-instances from the target application schema: A RHS construct-instance associated with a connection correspondence in the construct-correspondence clause of the construct equivalence results in searching in 208 the target application schema for its corresponding application construct (as indicated in the RHS of the connection correspondence). If the corresponding application construct has not yet existed in the target application schema, execution of this construct equivalence will be deferred. 3. Application construct creation in the target application schema: If a RHS construct-instance is not associated with a correspondence connection in the construct-correspondence clause of the construct equivalence, an application construct denoted by this construct-instance will be created in the target application schema when the construct equivalence is executed. If the RHS construct-instance is defined on a path-referenced construct-domain, a linkage between the application construct denoted by the origin of the path and the newly created application construct for the RHS construct instance will automatically be established. 4. Assignment implied by property and relationship correspondences: A property (or relationship) correspondence indicates an assignment operation by which the value represented in the RHS of the correspondence is assigned to the property of the application instance (or the terminal of the path) denoted by the LHS of the correspondence. If a property or relationship correspondence involves an inner construct-instance which is not the operand of a set operator, the property or relationship will be executed for every application construct included in the construct-instance-set to which the inner construct-instance belongs. 5. No execution effect from ancillary-description: The ancillary-description of a construct equivalence does not have any execution semantics. That is, when a construct equivalence is executed, its ancillary- 209 description will not cause any change in the source or target application schema. The ancillary-description may be listed as additional constraints on the target application schema. During the schema translation process, intra-model construct equivalences of the source data model are employed in the source convergence stage and those of the target data model are employed in the target enhancement stage, as shown in Figure 2.8. On the other hand, intra-model construct equivalences can be used in the schema normalization process. In all cases, the execution of an intra-model construct equivalence involves instantiating LHS and possibly RHS construct-instances from application schema, creating new application constructs in the application schema, and possibly deleting existing application constructs from the application schema. The detailed execution semantics of intra-model construct equivalences is depicted as follows: 1. Instantiation of LHS construct-instances from the application schema: Same as that defined in the execution semantics of inter-model construct equivalence. 2. Instantiation of RHS construct-instances from the application schema: If the RHS of a selection-condition on a LHS construct-instance refers to a RHS construct-instance, an instantiation of the RHS construct-instance will be triggered. Since intra-model construct equivalences do not involve any connection correspondence, no execution delay will occur. 3. Application construct creation in the application schema; If a RHS construct-instance (or constnict-instance-set) is not referenced by the RHS of any path selection on any LHS construct-instance, an application construct 210 denoted by this construct-instance (or construct-instance-set) will be created in the application schema when the construct equivalence is executed. If the RHS construct-instance is defined on a path-referenced construct-domain, a linkage between the application construct denoted by the origin of the path and the newly created application construct for the RHS construct-instance will automatically be established. 4. Application construct deletion in the application schema: If a LHS construct-instance (or construct-instance-set) does not appear in the LHS of any property or relationship correspondence or is not enclosed within { } in the RHS of any relationship correspondence, the application construct denoted by the construct-instance (or construct-instance-set) will be removed from the application schema after executing the construct equivalence. 5. Assignment implied by property and relationship correspondences: Same as that defined in the execution semantics of inter-model construct equivalence. 6. No execution effect from ancillary-description: Same as that defined in the execution semantics of inter-model construct equivalence. The execution semantics of inter-model and intra-model construct equivalences is summarized in Table 5.2. Two major execution semantics differences between intermodel and intra-model construct equivalences can be identified. One difference concerns the instantiation of RHS construct-instances and the application construct creation and is resulted from the use of connection correspondences only in inter-model construct equivalences but not in intra-model construct equivalences. The other 211 difference is the application construct deletion. Since inter-model construct equivalences are used in source-target projection stage of the schema translation, the main objective is to create a target application schema corresponding to the source application schema. Thus, the deletion of application constructs from either the source application schema or the target application schema is not required. On the other hand, since intra-model construct equivalences are used in the source convergence or target enhancement stage of the schema translation or in the schema normalization process, transforming from the current application schema into a semantically equivalent one may involve the deletion of application constructs in the current application schema. 212 Application construct deletion Not applicable Property and relationship correspondence Ancillarydescription Resulting in an assignment operation Intra-model Construct Equivalence From the current application schema Triggered by references in LHS construct-instances When a RHS constructinstance is not referenced by the RHS of any path selection on any LHS construct instance When a LHS constructinstance does not appear in the LHS of any correspondence or is not enclosed within { } in the RHS of any relationship correspondence Resulting in an assignment operation No execution semantics No execution semantics Instantiation of LHS construct-instances Instantiation of RHS construct-instances Application construct creation Inter-model Construct Equivalence From the source application schema Triggered by connection correspondences When a RHS constructinstance is not associated with a coimection correspondence Table 5.2: Summary of Execution Semantics of Construct Equivalence 5.3 Information Preserving Construct Equivalence Transformation Function To allow a construct equivalence be bi-directional, a construct equivalence transformation function (employed by the construct equivalence transformation method) for exchanging construct-sets between two sides of a construct equivalence and restructuring other components of the construct equivalence is necessary. As discussed, it needs to ensure the validity of transformed construct equivalences and be an information preserving function. 213 The tasks and steps of transforming a construct equivalence consist of 1) restructuring all ancillary clauses in the ancillary-description clause, 2) restructuring all correspondences (connection, property, and relationship correspondences) in the construct-correspondence clause, 3) restructuring all complex construct-domains of LHS construct-instances, 4) restructuring all selection clauses of the instance-selection-condition on each LHS construct-instance, 5) restructuring all selection clauses of the set-selection-condition on each LHS construct-instance-set, and 6) exchanging the LHS construct-set with the RHS construct-set. Each step employs a set of restructuring operations. The restructuring operations for steps 1 to 2 are listed in Table 5.3, and those for steps 3, 4 and 5 are listed in Table 5.4. The operation required by the step 6 is simply an exchange operation between two sides of a construct equivalence. 214 Restructuring Operation Original Construct Equivalence Ancillary clause (N) - B[uC...]n{V}=0 DPI: Append B [ -C ...]" into the constructdomain of the RHS construct-instance V and remove N from the ancillary-description. - B [ o C . . . ] n { V } = {V} OP2: Append "n B [ n C ...]" into the constructdomain of the RHS construct-instance V remove N from the ancillary-description. OP3: Exchange the antecedent of N with the condition - Antecedent IF condition of N. If the new antecedent of C is a property correspondence or the path selection of the first form with a set equality comparison operator, C is converted into correspondence in the construct-correspondence. OP4: N is converted into a selection clause of the - Others construct-instance (or construct-instance-set) denoted by the first construct-instance (or construct-instanceset) in N. Construct-Correspondence (C) OPS: Exchange the LHS of C with the RHS of C. - Connection correspondence - Unconditional property correspondence => S.property = value OP6: C is converted into a property selection of the RHS construct-instance S. => S.property = T.property OPT: Exchange the LHS of C with the RHS of C. - Unconditional relationship correspondence S.path.M = P OPS: If P is a path-referenced model construct and the origin construct-instance of P is associated with a connection correspondence, exchange the LHS of C with the RHS of C. C is converted into a path selection clause of the construct-instance denoted by the origin construct-instance of P. => S.path.M = P <set-op> Q ... OP9: If <set-op> is change it into 'w'. If <set-op> is 'vj change it into OPIO: Exchange the antecedent of C with the - Conditional property condition of C. If the new antecedent of C is neither a correspondence property correspondence nor the path selection of the first form with a set equality comparison operator, C is converted into an ancillary clause in the ancillarydescription See OPIO - Conditional relationship correspondence Table 5.3: Restructuring Operations for Ancillary Clause and Construct-Correspondence 215 Original Construct Equivalence LHS construct-instance (V) with complex construct-domain - V:A-B[-C...] - V:AnB[nC...] Restructuring Operation CPU: Add an ancillary clause (B [ u C ...] n {V} = 0) in the ancillary-description of the construct equivalence and remove B [ - C ...]" from the construct-domain of V. OP12: Add an ancillary clause (B [ o C ...] n {V} = {V}) in the ancillary-description of the construct equivalence and remove "n B [ n C ...]" from the construct-domain of V. Selection clause (C) of the instanceselection-condition on a LHS construct-instance - Property selection clause OP13: C is converted into a property correspondence. - Path selection clause => First form with set equality OP14: If P is a path-referenced model construct and comparison operator S is associated with a connection correspondence, exchange the LHS of C with its RHS (i.e., C now (i.e., S.path.M = P) becomes P = S.path.M). C is converted into a relationship correspondence. OP15: C is converted into an ancillary-clause. First form with non-equality comparison operator, second form, and third fomi See OP15 - Extension selection clause See OP15 Selection clause (C) of the setselection-condition on a LHS construct-instance-set Table 5.4: Restructuring Operations for Complex Construct-Domains and Selection Clauses on LHS Construct-Instances (or Construct-Instance-Sets) Example 7: (continued from Example 5) As shown in Example 5, the mter-model construct equivalence which is specified as "Each non-foreign key attribute of a relation in the relational model is equivalent to a single-valued attribute of the entity-class for the relation in the SOGER model. ..." is represented as: 216 R: Relational.Relatlon AND N: R.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class AND A: E.Attribute WITH ERRAND A.Multipliclty = 'single-valued' AND A.Name = N.Name AND A.Null-Spec = N.lsNull AND A.Uniqueness = N.lsUnique AND A.DataType = N.DataType The transformation process of this inter-model construct equivalence is depicted below and graphically summarized in Figure 5.1. R: Relational.Relatlon AND N; R.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class AND A: E.Attribute E = R AND A.Multiplicity = 'sinqle-valued' AND A.Name = N.Name AND A.Null-Spec = N.lsNull AND A.Uniqueness = N.lsUnique AND A.DataTvpe = N.DataType <- OPll (step 3) <-OP5 <-OP6 <-OP7 <-OP7 <-OP7 <-OP7 (step 2) (step 2) (step 2) (step 2) (step 2) (step 2) Figure 5.1: Process of Transforming Construct Equivalence of Example 5 1) Restructuring all ancillary clauses in the ancillary-description clause: Since the original construct equivalence does not contain any ancillary clause, no action is performed in this step. 2) Restructuring all correspondences in the construct-correspondence clause: According to the restructuring operation 0P5, the LHS of the connection correspondence E = R will be exchanged with its RHS (i.e., the connection 217 correspondence will become R = E). Since the RHS of the property correspondence A.Multiplicity = 'single-valued' is a constant value, 0P6 is executed. Consequently, the property correspondence becomes a property selection clause of the constructinstance A. According to 0P7, the next property correspondence clause A.Name = N.Name is restructured and becomes N.Name = A.Name. The same restructuring operation is applied to the remaining property correspondence clauses since their format is the same as A.Name = N.Name. 3) Restructuring complex construct-domains of LHS construct-instances: The LHS construct-instance N is defined on the complex construct-domain R.Attribute R.Foreign-Key.Attribute. Therefore, according to OP11, the ancillary clause R.Foreign-Key.Attribute n {N} = 0 will be added into the ancillary-description. 4) Restructuring instance-selection-conditions on LHS construct-instances: Since the construct equivalence has no instance-selection-condition, no action is taken in this step. 5) Restructuring set-selection-conditions on LHS construct-instance-sets: Since the LHS construct-set of the construct equivalence does not involve any construct-instanceset, no action is taken in this step. This inter-model construct equivalence after this step is as follows: R: Reiational.Relation AND N: R.Attribute E: SOOER.Entity-Class AND A: E.Attribute (WHERE A.Multiplicity = 'single-valued') WITH RhEAND N.Name = A.Name AND N.Null-Spec = AlsNull AND N.Uniqueness = A.lsUnique AND N.DataType = ADataType 218 HAVING R.Foreign-Key.Attribute r» {N} = 0 6) Exchanging the LHS construct-set with the RHS construct-set; The LHS constructinstances R and N are exchanged with the RHS construct-instances E and A. The resulted construct equivalence is shown in Figure 5.2. E: SOOER-Entlty-Class AND A: E.Attribute (VWERE A.Multiplicity = 'single-valued') R: Relational.Relation AND N; R.Attribute WITH R = EAND N.Name = A.Name AND N.Null-Spec = A.lsNull AND N.Uniqueness = A.lsUnique AND N.DataType = A.DataType HAVING R.Foreign-Key.Attribute n {N} = 0 Figure 5.2: Construct Equivalence Transformed from Example 5 The transformed inter-model construct equivalence can be interpreted as: each singlevalued attribute of an entity-class in the SOOER model is equivalent to an attribute of the relation for the entity-class in the relational model. The attribute in the relation is a nonforeign-key attribute (as specified in the ancillary clause R.Foreign-Key.Attribute o {N} = 0). The name (or data-type) of the non-foreign key attribute is the same as that of the single-valued attribute. Whether the non-foreign key attribute allows null values or not is determined by the null-specification of the single-valued attribute. The IsUnique property of the non-foreign attribute corresponds to the uniqueness property of the single-valued attribute." The meaning of the transformed inter-model construct equivalence conforms to the reverse direction of its original construct equivalence. 219 Example 8: (continued from Example 6) The intra-model construct equivalence within the SOGER model shown in Example 6 translates a multi-valued attribute into a new entity-class, a single-valued attribute, and an association relationship connecting the new entity-class and the entity-class to which the multi-valued attribute belongs. The complete intra-model construct equivalence is represented as below. E; Entity-Class AND M: E.Attribute (WHERE M.Multiplicity = 'multi-valued') N; Entity-Class AND A: Attribute AND R: Association WITH A.Multiplicity = 'single-valued' AND R.relate.Entity-Class = {E, N} AND N.Attribute = {A} AND N.Identifier.Attribute = N.Attribute AND A.Name = M.Name AND A.DataType = M.DataType AND A.Uniqueness = 'unique' AND A.Null-Spec = 'not-null' AND R.relate.E(Max-Card) = m IF M.Uniqueness = 'not-unique' AND R.relate.E(Max-Card) = 1 IF M.Uniqueness = 'unique' AND R.relate.E(Min-Card) = 1 AND R.relate.N(Max-Card) = m AND R.relate.N(Min-Card) = 0 IF M.Null-Spec = 'null' AND R.relate.N(Min-Card) = 1 IF M.Null-Spec = 'not-null' The transformation process of this inter-model construct equivalence is graphically depicted in Figure 5.3. The resulted construct equivalence is shown in Figure 5.4. 220 E: Entity-Class AND M: E.Attribute (WHERE M.Multiplicity = 'multi-valued') N: Entity-Class AND A: Attribute AND R: Association WITH A.Multiplicity = 'single-valued' AND R.relate.Entity-Class = (E. N} AND N.Attribute=(A>AND N.Identifier.Attribute = N.Attribute AND A.Name = M.Name AND A.DataTvpe = M.DataTvpe AND A.Uniqueness = 'unique' AND A.Null-Spec = 'not-null' AND R.relate.EfMax-Card^ = m IF M.Uniqueness = 'not-unique' AND <- OP13 (step 4) OP6 ^ OPS OPS <- OPS <- OPT <- OP7 <- OP6 <r- OP6 (step 2) (step 2) (step 2) (step 2) (step 2) (step 2) (step 2) (step 2) 4-OPlO (step 2) R.rel9teE(M9x-C9rci) = 1 IF M.Uniqueness = 'unique' AND R.relate.E(Min-Card) = 1 AND R.relate.N(Max-Card^ = m AND R.relate.N(Min-Card^ = 0 IFM.Null-Spec = 'null' AND R.relate.NrMin-Card) = 1 IF M.Null-Spec = 'not-null' •<-OP10 (step 2) <— OP6 (step 2) OP6 (step 2) <-OP10 (step 2) <r- OPIO (step 2) Figure 5.3: Process of Transforming Construct Equivalence of Example 6 221 N: Entity-Class A: Attribute R: Association (WHERE N.Attribute = {A} AND N.Attribute = N.ldentlfier.Attribute) AND (WHERE A.Multipllclty = 'single-valued' AND A.Uniqueness = 'unique' AND A.Null-Spec = 'not-null') AND (WHERE R.relate.Entity-Class = {E, N} AND R.relate.E(Min-Card) = 1 AND R.relate.N(Max-Card) = m) E: Entity-Class AND M: E.Attribute WITH M.Name = A.Name AND M.DataType = A.DataType AND M.Uniqueness = 'not-unique' IF R.relate.E(Max-Card) = m AND M.Uniqueness = 'unique' IF R.relate.E(Max-Card) = 1AND M.Null-Spec = 'null' IF R.relate.N(Min-Card) = 0 AND M.Null-Spec = 'not-null' IF R.relate.N(Min-Card) = 1 AND M.Multiplicity = 'multi-valued' Figure 5.4: Construct Equivalence Transformed from Example 6 This transformed inter-model construct equivalence shown in Figiure 5.4 can be interpreted as; in the SOGER model, each entity-class N having a 'single-valued', 'unique' and 'not-null' attribute as its only attribute as well as its identifier, and an association relationship R connecting N (with the maximal cardinality as m) to another entity-class E (with the minimal cardinality as 1) are equivalent to a multi-value attribute of the entity-class E. The name and the data type of the multi-valued attribute are the same as those of the single-valued attribute. If the maximal cardinality of the link between the association relationship R and the entity-class E is m (i.e., an instance of the entity-class N can be associated with multiple instances of the entity-class E), the Uniqueness property of the multi-valued attribute is 'not-unique'; otherwise, it is 222 'unique'. If the mininaai cardinality of the link between the association relationship R and the entity-class N is 0 (i.e., an instance of the entity-class E may not be associated with any instance of the entity-class N), the Null-Spec property of the multi-valued attribute is 'null'; otherwise, it is 'not-null'." As discussed, an essential requirement of the construct equivalence transformation function is information preserveness. A construct equivalence transformation fimction F is an information preserving fimction if F(F(E)) = E where E is a construct equivalence. Theorem 1: Construct equivalence transformation fimction based on the restructuring operations is an information preserving fiinction. [Proof] An information preserving construct equivalence transformation fimction requires that the effect resulted firom an restructuring operation can be inversed by the same or a different restructuring operation. In other words, every restructuring operation 0 needs to have an inverse restructuring operation R whose inverse restructuring operation is the restructuring operation O. Table 5.5 shows the inverse restructuring operation of each restructuring operation employed by the construct equivalence transformation function. 223 Restructuring Operation OPl (ancillary clause -> complex construct-domain) OPl (ancillary clause complex construct-domain) OPS ancillary clause with condition -> ancillary clause with condition ancillary clause with condition -> conditional correspondence OP4 (ancillary clause selection clause) OPS (connection correspondence exchange) OP6 (imconditional property correspondence property selection) OP7 (unconditional property correspondence exchange) OPS (imconditional relationship correspondence path selection) OP9 (unconditional relationship correspondence set-op conversion) OPIO conditional correspondence -> conditional correspondence conditional correspondence —)• ancillary clause with condition OPll (complex construct-domain -> ancillary clause) OP12 (complex construct-domain ancillary clause) OP13 (property selection -> unconditional property correspondence) OP14 (path selection -> unconditional relationship correspondence) OP15 (path selection, extension selection, and set-selection-condition -> ancillary clause) Inverse Restructuring Operation OPll (complex construct-domain -> ancillary clause) OP12 (complex construct-domain -> ancillary clause) OPS (ancillary clause with condition ->• ancillary clause with condition) OPIO (conditional correspondence ancillary clause with condition) OP15 (path selection, extension selection, and set-selection-condition -> ancillary clause) OPS (connection correspondence exchange) OP13 (property selection unconditional property correspondence) OP7 (unconditional property conespondence exchange) OP14 (path selection -> unconditional relationship correspondence) OP9 (unconditional relationship correspondence set-op conversion) OPIO (conditional correspondence conditional correspondence) OPS (ancillary clause with condition -> conditional correspondence) OPl (ancillary clause -> complex construct-domain) OP2 (ancillary clause complex construct-domain) OP6 (unconditional property correspondence -> property selection) OPS (unconditional relationship correspondence path selection) OP4 (ancillary clause —>• selection clause) Table 5.5: Inversibility of Restructuring Operations of the Construct Equivalence Transformation Function 224 Assume a construct equivalence to be transformed be E, the transformed construct equivalence of E be E', and the transformed construct equivalence of E' be E". In the following, the proof of the inversibility between the restructuring operations OPl and OPll is provided. The proof of the inversibility between the rest of restructuring operations can be obtained similarly and thus will not be provided further. Case 1: The inverse restructuring operation of OPl is OPll. For each ancillary clause N of E with the form of "B [ u C ...] n {V} = 0," the restructuring operation OPl appends B [ - C ...]" into the construct-domain of the RHS construct-instance V and removes N from the ancillary-description. Assume the original construct-domain of V be A. A must be a simple construct-domain since the construct-instance V is in the RHS of the construct equivalence to be transformed. After the construct-set exchange as indicated in the step 6 of the transformation function, the construct-instance V becomes a LHS construct-instance with a complex constructdomain "V: A - B [ - C ...]" in E'. When E' is transformed into E", the restructuring operation OPll adds an ancillary clause "B [ u C ...] n {V} = 0" into the ancillarydescription and removes B [ - C ...]" from the construct-domain of the LHS construct-instance V (i.e., the construct-domain of V is A now). After the construct-sets of E' are exchanged, the construct-instance V becomes a RHS construct-instance in E". Since V is a RHS construct-instance with a simple construct-domain A in both E and E", the ancillary-clause N is "B [ u C ...] n {V} = 0" in both E and E", and no other component in a construct equivalence is restructured by OPl and OPll, E and E" are 225 identical with respect to all ancillary-clauses with the form of "B [ u C ...] n {V} = 0." Thus, the inverse restructuring operation of OPl is OPll. Case 2: The inverse restructuring operation of OPl 1 is OPl. For each complex construct-domain with the form of "V: A - B [ - C ...]" for a LHS construct-instance V, the restructuring operation OPll adds an ancillary clause "B [ u C .„] n {V} = 0" into the ancillary-description and removes B [ - C ...]" from the construct-domain of the LHS construct-instance V (i.e., V has a simple construct-domain A now). After the construct-sets of E are exchanged, the construct-instance V becomes a RHS construct-instance in E'. When E' is transformed into E", since E' has an ancillary clause with the form of "B [ u C ...] n {V} = 0," the restructuring operation OPl appends B [ - C ...]" into the construct-domain of the RHS construct-instance V (i.e., now the construct-instance V has a complex construct-domain "A - B [ - C ...]") and removes N from the ancillary-description. After the construct-sets of E' are exchanged, the construct-instance V becomes a LHS construct-instance with a complex constructdomain in E". Since V is a LHS construct-instance with a complex construct-domain "V: A - B [ - C ...]" in both E and E", the ancillary-clause "B [ u C ...] n {V} = 0" added by OPll has been removed by OPl, and no other component in a construct equivalence is restructured by OPll and OPl, E and E" are identical with respect to each complex-construct-domain of a LHS construct-instance. restructuring operation of OPll is OPl. Thus, the inverse 226 Thus, combining Case 1 and Case 2, the inversibility of the restructuring OPl and OPl 1 is proved. As mentioned, the inversibility between the rest of restructuring operations can also be proved similarly. Hence, the construct equivalence transformation fiinction based on the restructuring operations OPl to OP15 is an information preserving function. Example 9: (continued from Example 7) The construct equivalence specified as "each non-foreign key attribute of a relation in the relational model is equivalent to a single-valued attribute of the entity-class for the relation in the SOGER model. ..." is transformed into the construct equivalence shown in Figure 5.2. Here, it will be shown that, after applying the construct equivalence transformation function on this transformed construct equivalence, the resulted construct equivalence will be the same as the original construct equivalence. The process of transforming the construct equivalence transformed from Example 5 is summarized in Figure 5.5. E; SOOER.Entity-Class AND A; E.Attribute A/VHERE A.Multiplicity = 'single-valued') R; Relational.Relation AND N: R.Attribute WITH R = E AND N.Name = A.Name AND N.Null-Spec = AlsNull AND N.Uniqueness = A.lsUnique AND N.DataType = ADataType HAVING R.Foreiqn-Kev.Attribute n {N} = 0 <-OP13 (step 4) <-OP5 <r-0?7 <-OP7 <-OP7 <-OP7 <-OPl Figure 5.5: Construct Equivalence Transformed from Example 5. (step 2) (step 2) (step 2) (step 2) (step 2) (step 1) 227 The inversibility of the restructuring operations can be observed by comparing Figure 5.1 with Figure 5.5. For example, A.MultipIicity ='single-vaiued' initially was a property correspondence. As shown in Figure 5.1, this property correspondence was transformed by the restructuring operation OP6 and became a property selection clause of the construct-instance A (as shown in Figure 5.2). When transforming the construct equivalence shown in Figure 5.5, the restructuring operation OP13 is applied on the property selection clause A.MultipIicity = 'single-valued' of the LHS construct-instance A and converts the property selection clause into a property correspondence. Thus, the restructuring operation OP13 offsets the effect of OP6 and is therefore the inverse restructuring operation of OP13. Furthermore, the inverse restructuring operation of OPS is itself, as shown in Table 5.5 and illustrated by the restructuring operation OPS on E s R in Figure 5.1 and OPS on R = E in Figure 5.5. Consequently, the construct equivalence resulted from Figure 5.5 is depicted in Figure 5.6. It is the same as its original construct equivalence, as shown in Example 3 and the beginning of Example 7. 228 R: Relational.Relation AND N: R.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class AND A; E.Attribute WITH E = R AND A.Name = N.Name AND A.Null-Spec = N.lsNull AND A.Uniqueness = N.lsUnique AND A.DataType = N.DataType AND A.Multiplicity = 'single-valued' Figure 5.6: Construct Equivalence Transformed from Example 5 5.4 Evaluation of Construct Equivalence Assertion Language Against Design Principles A construct equivalence specified in CEAL only asserts an equivalence (similar to a mathematical equation) between two sets of model constructs. Although construct equivalences are associated with execution semantics, the reasoning method is not embedded in construct equivalences. Thus, it is declarative and satisfies the first design principle (i.e., declarative nature) for a construct equivalence representation defined in Section 5.1. The use of the ancillary-description clause and the information-preserving construct equivalence transformation function ensures that a construct equivalence can be interpreted and reasoned from either direction. Thus, the second design principle (i.e., bi-directional construct equivalences) is observed. A construct-instance (or a constructinstance-set) in the construct set of a construct equivalence is a variable for an instance (or a set of instances) of a model construct and thus can be instantiated by application 229 constructs in an application schema to be translated (as required by the third design principle, instantiable by application constructs). As discussed, a construct equivalence defines an equivalence between two construct-sets (i.e., construct-set = construct-set) each of which consists of a conjunction of constructinstances and/or construct-instance-sets. If both construct-sets in a construct equivalence consist of a single construct-instance or construct-instance-set, the construct equivalence is one-to-one. It is a one-to-many (or many-to-one) construct equivalence if one of its construct-sets consists of a single construct-instance or construct-instance-set and the other has multiple construct-instances and/or construct-instance-sets. If both of the construct-sets include multiple construct-instances and/or construct-instance-sets, it becomes a many-to-many construct equivalence. Accordingly, CEAL is capable of expressing one-to-one, one-to-many (or many-to-one), and many-to-many construct equivalences; thus, the fourth design principle is satisfied. Moreover, the constructdomain of a construct instance can be annotated by a model name. All construct- instances and construct-instance-sets belonging to the same construct-set of a construct equivalence must be drawn from the same model schema, while construct-sets in different sides of a construct equivalence can be defined upon different model schemas. If both of the construct-sets of a construct equivalence are from the same model schema, the construct equivalence is an intra-model one; otherwise, it becomes an inter-model construct equivalence. Therefore, a single language for both intra-model and inter-model construct equivalences which is required by the last design principle is supported in CEAL. 230 As for the fifth design principle (i.e., partial model construct in a construct equivalence) to LHS construct-instances, the supporting features in CEAL include complex constructdomains for and instance-selection-conditions on LHS construct-instances. As illustrated previously, a partial model construct "each non foreign-key attribute in the relational model" can be expressed with a complex "Relational.Attribute - Relational.Foreign-Key.Attribute". construct-domain as As such, a complex construct-domain can be used to define a partial model construct in a construct equivalence in terms of set operations on two or more model constructs. In addition, an instance-selection-condition on a LHS construct-instance can be employed to specify a partial model construct in terms of restrictions on its property, path, or extension. On the other hand, property correspondences with values in their RHSs and relationship correspondences whose origin construct-instances are specified in the RHS of construct equivalences determine some characteristics of RHS construct-instances. Thus, even RHS construct-instances cannot be associated to complex construct-domains or instanceselection-conditions, partial model constructs in the RHS of a construct equivalence can still be represented in the construct-correspondence clause. The use of construct-instance-set and the reference protocol to irmer construct-instances of construct-instance-sets render the expressiveness to CEAL for specifying multiple instances of the same model construct in construct equivalences and thus support the sixth design principle. The provision of property and relationship correspondences in the construct-correspondence contributes to CEAL's capability in defining detailed 231 correspondences for construct-equivalences, as required by the seventh design principle. Table 5.6 summarizes the evaluation of CEAL against its design principles. Design Principle 1. Declarative nature 2. Bi-directional construct equivalences 3. Instantiable by application constructs 4. 1-to-l, 1-to-m or m-to-m construct equivalence 5. Partial model construct in a construct equivalence 6. Multiple instances of the same model construct in a construct equivalence 7. Detailed correspondences for construct equivalences 8. Single language for both intermodel and intra-model construct equivalence Supporting Features in Construct Equivalence Assertion Language => Expression of construct equivalences similar to mathematics equations => Ancillary-description in the HAVING clause => Information-preserving transformation function => Construct-instance and construct-instance-set => Construct-set as a conjunction of constructinstances and/or construct-instance-set => Complex construct-domain for a LHS construct-instance => Selection-condition on a construct-instance => Property and relationship correspondence for a RHS construct-instance => Construct-instance-set possibly with setselection-condition => Construct-correspondence in the WITH clause - Property correspondence - Relationship correspondence => Model name prefixed in model constructs when constructing construct-domains Table 5.6: Evaluation of Construct Equivalence Assertion Language 232 CHAPTER 6 Construct-Equivalence-Based Schema Translation and Schema Normalization This chapter details algorithms for the construct-equivalence-based schema translation and construct-equivalence-based schema normalization. As shown in Figure 2.9, since both are built upon construct equivalence transformation and reasoning methods, die development of these two methods will be described first. 6.1 Construct Equivalence Transformation Method The construct equivalence transformation method ensures that the construct-sets of each construct equivalence involved in a particular translation or normalization task be in the desired direction. The construct-equivalence-based schema translation involves three transformation tasks: 1) transformation of the inter-model construct equivalences between the source and the target data model, 2) transformation of the intra-model construct equivalences of the source data model, and 3) transformation of the intra-model construct equivalences of the target data model. On the other hand, construct- equivalence-based schema normalization requires only one transformation task, which is to transform, based on pre-determined normalization criteria, intra-model construct equivalences of the data model in which the application schema is specified. Because the transformation task required by the construct-equivalence-based schema normalization is 233 highly similar to the third transformation task of the construct-equivalence-based schema translation, the development of these two algorithms will be discussed together. Algorithm for Transformation of Inter-Model Construct Equivalences: This algorithm ensures that the LHS construct-set of each inter-model construct equivalence be specified on the model constructs of the source data model and that its RHS construct-set be specified on the model constructs of the target data model. Thus, it simply involves the exchange of construct-sets of all inter-model construct equivalences whose current directions are the reverse of the desired direction. The detailed algorithm of this transformation is depicted in Algorithm 6.1. InterCE-Transform(CEsT» SM, TM): /* Input: CEsx (inter-model construct equivalences between the source and the target data model), SM (the name of the soiu-ce data model), and TM (the name of the target data model). Result: transformed CEST (each of inter-model construct equivalence whose LHS construct-set is specified on the model constructs of the source data model and RHS construct-set of E is specified on the model constructs of the target data model) */ Begin For each inter-model construct equivalence E in CEgj If the LHS construct-set of E is specified on the model constructs of TM Then transform E according to the construct equivalence transformation function described in Section 5.3. End. Algorithm 6.1: Inter-model Construct Equivalence Transformation 234 Algorithm for Transformation of Intra-Model Construct Equivalences of the Source Data Model: The goal of this transformation is to ensure that all model constructs in the nonoverlapped semantic space of the source data model can be translated into the model constructs in the overlapped semantic space of the source data model. Since the overlapped semantic space of the source data model has, in fact, been represented in the LHS constructs-sets of the inter-model construct equivalences, this transformation is directed by the LHS construct-sets of all inter-model construct equivalences. Accordingly, the transformation process is initiated from identifying intra-model construct equivalences which are directly RHS-associated with some inter-model construct equivalences, followed by identifying the intra-model construct equivalences which are indirectly RHS-associated with the inter-model construct equivalences (i.e., directly RHS-associated with the union of the inter-model construct-equivalences and the intra-model construct equivalences which have been identified as directly or indirectly RHS-associated with the inter-model construct equivalences). A construct equivalence is said to be directly RHS-associated with a set of construct equivalences if the RHS construct-set of the former is the same^ as or a subset'^ of any combination of LHS ' A construct-set CS, is said to be the same as another construct-set CS, if I) for each construct-instance CI in CSj there exists one construct-instance in CS2 whose construct-domain and instance-selection-condition are the same as those of CI, 2) for each construct-instance-set CIS in CS, there exists one constructinstance-set in CS2 whose inner construct-instance and set-selection-condition are the same as those of CIS, 3) for each construct-instance CI in CS, there exists one construct-instance in CS, whose construct-domain and instance-selection-condition are the same as those of CI, and 4) for each construct-instance-set CIS in CST there exists one construct-instance-set in CS, whose inner construct-instance and set-selection condition are the same as those of CIS. A construct-set CS, is said to be a subset of another construct-set CS, if I) for each construct-instance CI in CS, there exists one construct-instance in CS, whose construct-domain and instance-selection-condition are the same as those of CI, and 2) for each construct-instance-set CIS in CS, there exists one constructinstance-set in CS, whose inner construct-instance and set-selection-condition are the same as those of CIS. 235 construct-sets of the latter. Since the constnict-sets between two sides of a construct equivalence are interchangeable, checking whether a construct equivalence is RHSassociated with a set of construct equivalences needs to be performed on both the construct equivalence and its transformed construct equivalence (according to the construct equivalence transformation function described in Section 5.3). Furthermore, after all of the intra-model construct equivalences directly and indirectly RHS-associated with the inter-model construct equivalences are identified, the remaining intra-model construct equivalences are removed from the set of intra-model construct equivalences for the intended schema translation because they cannot lead to any model construct in the overlapped semantic space. The detailed algorithm for transforming intra-model construct equivalences of the source data model is shown in Algorithm 6.2. 236 SourceCE-Transfonii(CEs, CEgr): /* Input: CEs (intra-model construct equivalences of the source data model) and CEyr (inter-model construct equivalences). Result; transformed CEs Begin Goal-Set = CESTMark every intra-model construct equivalence in CEj as 'unprocessed'. Repeat For each 'improcessed' intra-model construct equivalence E in CEg > Transform E and assign the resulted construct equivalence into E'. > If E is directly RHS-associated with Goal-Set (i.e., the RHS constructset of E is the same as or a subset of any combination of LHS constructsets in Goal-Set) Then If E' is directly RHS-associated with Goal-Set Then nl = the number of application constructs created - the number of application constructs deleted, if E is executed (based on the execution semantics defined in Section 5.3). n2 = the number of application constructs created - the number of application constructs deleted, if E' is executed. If nl > n2 Then select E Else select E' (see discussion 1) Else select E. Else If E' is directly RHS-associated with Goal-Set Then select E'. > If E is selected Then insert E into Goal-Set and mark E as 'processed'. > IfE' is selected Then insert E' into Goal-Set, replace E by E' and mark E as 'processed'. Until (no intra-model construct equivalence is inserted into the Goal-Set within this for-loop) or (all intra-model construct equivalences in CEs have the status of 'processed'). Remove all of the 'unprocessed' intra-model construct equivalences from CEsEnd. Algorithm 6.2: Intra-model Construct Equivalence Transformation for Source Data Model 237 Discussion 1: When both a construct equivalence E and its transformed construct equivalence E' are directly RHS-associated with a set of construct equivalences S, E (and of course E') is a construct equivalence involving a subset of model constructs of S. In this case, three alternatives can be employed: both E and E' are included, neither of them is included, or either E or E' is included in chains of reasoning. The first alternative will result in a cycle when performing source convergence of the schema translation. In other words, some application constructs (in the source application schema) denoted by the LHS construct-set of E can be translated into application constructs denoted by the RHS of E. By applying E', the application constructs resulting from applying E can be translated back to the application constructs before both E and E' are applied. To avoid increasing the complexity of the source convergence algorithm, this alternative will not be adopted. Although the second alternative does not cause a reasoning cycle during the source convergence stage, dropping both E and E' may degrade the quality of the target application schema since any additional semantics possibly generated by E or E' is not available for the target-enhancement stage. Thus, the second alternative also will not be adopted. The third alternative, which does not have the drawback of the second alternative but preserves its advantage, selects either E or E' to be included in chains of reasoning. The decision on which construct equivalence to select can be determined by the number of net application constructs created by them (i.e., the number of application constructs created minus the number of application constructs deleted, if E or E' is executed). A simple heuristics can be developed accordingly: If the number of net 238 application constructs resulting from executing E is greater than or equal to that of E', then E is selected; otherwise, E' is selected. Algorithm for Transformation of Intra-Model Construct Equivalences of the Target Data Model (or the Data Model Used in Schema Normalization): The goal of this transformation in the construct-equivalence-based schema translation is to expand the use of model constructs of the target data model by incorporating the model constructs in the non-overlapped semantic space of the target data model. The unrestricted scenario is to utilize all model constructs available in the target data model for expressing the target application schema. However, sometimes, it may be preferable not to employ some model constructs of the target data model in the target application schema. In such a case, chains of reasoning in the target enhancement stage in the construct-equivalence-based schema translation should not be terminated at these undesired model constructs. Thus, the transformation process of the intra-model construct equivalences of the target data model is driven by the overlapped semantic space of the target data model represented in the EIHS of the inter-model construct equivalences and is constrained by all undesired model constructs of the target data model. The representation of each imdesired model construct is the same as the syntax for construct-instance defined in Section 5.2.2. The set of all undesired model constructs (called aversion set) are connected by conjunctive operators. 239 The general transformation flow for the intra-model construct-equivalences of the target data model is similar to that for the intra-model construct-equivalences of the source data model (as shown in Algorithm 6.2). Since this transformation process is driven by the RHS construct-sets of the inter-model construct equivalences, the identification of intramodel construct equivalences directly and indirectly associated to the inter-model construct equivalences should be defined on the LHS construct-sets of the intra-model construct equivalences of the target data model. A construct equivalence is said to be directly LHS-associated with a set of construct equivalences if the LHS construct-set of the former is the same as or a subset of any combination of RHS construct-sets of the latter. After all of the intra-model construct equivalences directly and indirectly LHSassociated with the inter-model construct equivalences are identified, the remaining intra-model construct equivalences are removed from the set of intra-model construct equivalences for the intended schema translation since they cannot be visited directly or indirectly from the overlapped semantic space. In addition, all terminal intra-model construct equivalences (i.e., the intra-model construct equivalences which are not directly RHS-associated with the rest of intra-model construct equivalences) whose RHS construct-sets include any undesired model construct in the aversion set should be removed as well. The detailed algorithm for transforming intra-model construct equivalences of the target data model is depicted in Algorithm 6.3. 240 TargetCE-Transforin(CET, CEgx? A): /* Input: CEf (intra-model construct equivalences of the target data model), CEgr (inter-model construct equivalences), and A (aversion set). Result: transformed CEx */ Begin Origin-Set = CESTMark every intra-model construct equivalence in CEj- as 'unprocessed'. Repeat For each 'unprocessed' intra-model construct equivalence E in CE-r > Transform E and assign the resulted construct equivalence into E'. > If E is directly LHS-associated with Origin-Set (i.e., the LHS constructset of E is the same as or a subset of any combination of RHS constructsets in Origin-Set) Then If E is directly LHS-associated with Origin-Set Then nl = the number of application constructs created - number of application constructs deleted, if E is executed. n2 = the number of application constructs created - number of application constructs deleted, if E' is executed. If nl > n2 Then select E Else select E'. Else select E. Else If E is directly LHS-associated with Origin-Set Then select E'. > If E is selected Then insert E into Origin-Set Else insert E' into Origin-Set and replace E by E'. > Mark E as 'processed'. Until (no new construct equivalence is inserted into Origin-Set within this for-loop) or (all intra-model construct equivalences in CEj have the status of 'processed'). Remove all 'unprocessed' intra-model construct equivalences from CE-r. Repeat For each intra-model construct equivalence E in CE-j> If the construct-domain and the selection-condition of any constructinstance in the RHS construct-set of E are the same as those of any undesired model construct in A Then If E is not directly RHS-associated to the rest of intra-model construct equivalences in CEx Then remove E from CE-rUntil (no more construct equivalence is removed from CEx within this for-loop). End. Algorithm 6.3: Intra-model Construct Equivalence Transformation for Target Data Model 241 The transformation involved in the construct-equivalence-based schema normalization is the same as the restricted case of transforming the intra-model construct equivalences of the target data model for the construct-equivalence-based schema translation as described above, except that the former is not driven by the overlapped semantic space of the target data model but by the aversion set (and also constrained by the aversion set). Thus, with some minor modifications, the transformation algorithm for intra-model construct equivalences of the target data model can be adopted as the algorithm for intra-model construct equivalences for schema normalization. Such modifications include 1) dropping the set of inter-model construct equivalences (i.e., CEst in Algorithm 6.3) from the input of the transformation algorithm and 2) changing Origin-Set = CEST into OriginSet = A (where A is the aversion set). The resulting transformation algorithm required by the construct-equivalence-based schema normalization is depicted in Algorithm 6.4. 242 NormalCE-TransformCCEx? A): /* Input: CEj (intra-model construct equivalences of the data model used in the schema normalization process) and A (aversion set). Result: transformed CEx */ Begin Origin-Set = A. /* different from Algorithm 6.3 */ Mark every intra-model construct equivalence in CEj as 'unprocessed'. Repeat For each 'unprocessed' intra-model construct equivalence E in CEj > Transform E and assign the resulted construct equivalence into E'. > If E is directly LHS-associated with Origin-Set (i.e., the LHS constructset of E is the same as or a subset of any combination of RHS constructsets in Origin-Set) Then If E is directly LHS-associated with Origm-Set Then nl = the number of application constructs created - number of application constructs deleted, if E is executed. n2 = the number of application constructs created - number of application constructs deleted, if E' is executed. If nl > n2 Then select E Else select E'. Else select E. Else If E is directly LHS-associated with Origin-Set Then select E'. > If E is selected Then insert E into Origin-Set Else insert E' into Origin-Set and replace E by E'. > Mark E as 'processed'. Until (no new construct equivalence is inserted into Origin-Set within this for-loop) or (all intra-model construct equivalences in CEj have the status of 'processed'). Remove all 'unprocessed' intra-model construct equivalences from CE^. Repeat For each intra-model construct equivalence E in CEj > If the construct-domain and the selection-condition of any constructinstance in the RHS construct-set of E are the same as those of any undesired model construct in A Then If E is not directly RHS-associated to the rest of intra-model construct equivalences in CEf Then remove E from CE-j-. Until (no more construct equivalence is removed from CEj within this for-loop). End. Algorithm 6.4: Intra-model Construct Equivalence Transformation for Schema Normalization 243 6.2 Construct Equivalence Reasoning Method The construct equivalence reasoning method deals with the reasoning of intra-model construct equivalences and that of inter-model construct equivalences. In terms of its application, the construct-equivalence-based schema translation requires both types of reasoning in its three stages (i.e., source convergence, source-target projection, and target enhancement), while the construct-equivalence-based schema normalization involves only the reasoning of intra-model construct equivalences. Algorithm for Intra-Model Construct Equivalence Reasoning Method: The intra-model construct equivalence reasoning method is employed in the source convergence and target enhancement stages of the schema translation process and by the schema normalization process. The source convergence starts from an initial source application schema through a series of intra-model construct equivalence executions and works toward another application schema represented only by the model constructs in the overlapped semantic space of the source data model. The target enhancement goes through the same process; that is, transforming from an initial target application schema represented only by the model constructs in the overlapped semantic space of the target data model to another application schema expressed with additional model constructs in the non-overlapped semantic space of the target data model. Moreover, the described process is also applicable to the schema normalization process in which an initial application schema is transformed through a series of intra-model construct equivalence executions into another application schema containing no undesired model constructs. 244 Essentially, these three processes can be considered as a forward-chaining process [W92, GD93] in a rule-based system, where the application constructs in the source (or target) application schema are facts and the intra-model construct equivalences are knowledge (or rules). Consistent with a forward-chaining process, the source convergence, target enhancement, or schema normalization (after intra-model construct equivalences have been transformed appropriately by the construct-equivalence transformation method) involves the repetition of the following basic steps: 1. Matching: In this step, the intra-model construct equivalences are instantiated from an application schema to decide which intra-model construct equivalences are satisfied. A construct equivalence is satisfied if its instantiations from the application schema (i.e., binding-set^ the construct equivalence) are not empty (i.e., the bindingset is not an empty set and all columns in the binding-set are not completely empty). 2. Conflict Resolution: It is possible that the matching step will find multiple intramodel construct equivalences that are satisfied. The ones that have potential to be executed (i.e., their instantiations are not empty) constitute a conflict-set. Conflict resolution involves determining the priority of each intra-model construct equivalence in the conflict-set and then selecting the one with the highest priority to be executed. 3. Execution: The last step is the execution of the selected intra-model construct equivalence based on the execution semantics for intra-model construct equivalences defined in Section 5.2.5. ^ Binding-set is an n-dimensional array, where n is the number construct-instances and construct-instancesets in the LHS construct-set. Each column denotes a construct-instance or construct-instance-set in the LHS construct-set. The order of columns conforms to that in LHS construct-set. Each row denotes an instantiation of the LHS construct-set from the application schema to be translated or normalized. 245 Determination of the priority of each intra-model construct equivalence in the conflictset during the conflict resolution step requires a conflict resolution scheme. Commonly used conflict resolution schemes for forward-chaining production rule-based systems include specificity, complexity, and recency of rules [W92, GD93]. Adopting the concepts behind these common conflict resolution schemes with the consideration of the representation of construct equivalences distinct from that of rules, the conflict resolution scheme for intra-model construct equivalence reasoning includes the following principles. 1. LHS specificity principle: When the LHS construct-set of an intra-model construct equivalence in the conflictset is a superset of that of another intra-model construct equivalence in the conflictset, select the superset intra-model construct equivalence on the grounds that it deals with a more specific situation. 2. LHS complexity principle: The intra-model construct equivalence in the conflict-set with the maximal number of construct-instances and construct-instance-sets in its LHS construct-set will be selected. This principle always tries to attack the most refined portion of the problem. 3. Extent of consequence principle: In the conflict-set, the intra-model construct equivalence with the maximal number of net application constructs being created will be selected. This principle always tries 246 to generate more application constructs of possibly more different types of model constructs in an application schema. 4. Applicability principle: The intra-model construct equivalence in the conflict-set with the maximal number of rows in its binding-set (i.e., maximal number of instantiations from the application schema) will be selected to be executed. This principle may improve the efficiency of the schema translation or schema normalization because the maximum number of application constructs in the application schema will be translated or normalized in the execution of a single intra-model construct equivalence. The detailed algorithm for the intra-model construct equivalence reasoning method is shown in Algorithm 6.5, while the procedure for the conflict resolution is depicted in Algorithm 6.6. 247 IntraCE-Reasoning(CE, AS): /* Input: CE (intra-model construct equivalences of a data model) and AS (application schema in the data model). Result: updated AS */ Begin Repeat Initialize Conflict-Set as an empty set. /* matching step */ For each intra-model construct equivalence E in CE - Create a binding-set for all possible instantiations of the LHS constructset of E from AS - If the binding-set is not empty and none of its colunms is completely empty Then add E into Conflict-Set. If Conflict-Set is not empty, then /* conflict resolution step */ - C = Conflict-Resolution(Conflict-Set, binding-sets). /* execution step */ - For every row which has no empty cell in the binding-set of C > Instantiate the RHS construct-set of C according to the execution semantics for intra-model construct equivalences defined in Section 5.2.5. > Assert the instantiation of the RHS construct-set in AS. > Apply all property and relationship correspondences on AS for this instantiation. Until Conflict-Set is empty. End. Algorithm 6.5: Intra-model Construct Equivalence Reasoning Method 248 Conflict-Resolutioii(CS, BS): /* Input: CS (conflict-set which contains a set of construct equivalences) and BS (binding-sets each of which is associated with a construct equivalence in CS). Result: a construct equivalence selected from CS */ Begin - If CS has more than one construct equivalences Then Select the construct equivalence(s) from CS according to the LHS specificity principle and discard the un-selected ones from CS. - If CS still has more than one construct equivalences, then Then Select the construct equivalence(s) from CS according to the LHS complexity principle and discard un-selected ones from CS. - If CS still has more than one construct equivalences, then Then Select the construct equivalence(s) from CS according to the extent of consequence principle and discard un-selected ones from CS. - If CS still has more than one construct equivalences, then Then Select the construct equivalence(s) from CS according to the applicability principle and discard un-selected ones from CS. - If CS still has more than one construct equivalences Then Randomly select a construct equivalence from CS and discard the rest of construct equivalences from CS. - Return the only construct equivalence in CS. End. Algorithm 6.6: Conflict Resolution Algorithm for Inter-Model Construct Equivalence Reasoning Method: The inter-model construct equivalence reasoning method is required only by the sourcetarget projection stage in the construct-equivalence-based schema translation. It receives an application schema expressed in the source data model and produces, through a series of inter-model construct equivalence executions, an application schema expressed in the target data model. Unlike the intra-model construct equivalence reasoning method which updates the initial application schema into the final application schema, the source-target 249 projection stage produces a new application schema expressed in the target data model rather than updating the initial application schema expressed in the source data model. Since the source application schema will not be updated at all during the source-target projection stage, all 'satisfied' inter-model construct equivalences eventually will be executed. To ensure that the order of the inter-model construct equivalences will not affect the source-target projection result, an assumption of no conflicting inter-model construct equivalences needs to be made. That is, any two inter-model construct equivalences with identical LHS construct-sets and overlapped or distinct RHS construct-sets are not allowed. However, it is permissible for one inter-model construct equivalence to have the same LHS construct-set as another inter-model construct equivalence and for the RHS construct-set of the former to be a superset or subset of that of the latter. Under this assumption, the conflict resolution step in the intra-model construct equivalence reasoning method is not needed. Thus, the process for the intermodel construct equivalence reasoning method (i.e., the source-target projection stage) involves the repetition of the matching and the execution step. As described above, in the matching step, each inter-model construct equivalence is instantiated from the source application schema to decide which intra-model construct equivalences are satisfied. If satisfied, this inter-model construct equivalence will be executed based on the execution semantics for inter-model construct equivalences. The detailed algorithm for the intermodel construct equivalence reasoning method is shown in Algorithm 6.7. 250 IntraCE-Reasoning(CEsTt ASs* AS^): /* Input: CEyr (inter-model construct equivalences), and ASg (application schema in the source data model). Result: ASx (application schema in the target data model) */ Begin Initialize the status of each inter-model construct equivalence in CEst as 'valid'. Repeat For each inter-model construct equivalence E in CEST - Create a binding-set for all possible instantiations of the LHS constructset of E from ASg. - If the binding-set is empty or none of its columns (for construct-instance or construct-instance-set) is completely empty Then mark the status of E as 'invalid' and start the next for-loop. - If none of the RHS construct-instances of E is associated with a connection correspondence or all instantiations (in the binding-set) of every LHS construct-instance involved in a connection correspondence have corresponding application constructs in ASy, then > Mark the status of E as 'fired'. > For every row which has no empty cell in the binding-set • Instantiate the RHS construct-set and all correspondences (property and relationship) of E according to the execution semantics for inter-model construct equivalences defined in Section 5.3 and store the resulted instantiation in T. • If T has not been asserted in ASj Then assert T in ASj Else If T has been partially asserted in AS-p by another construct equivalence in CEgx (see Discussion 2 below) Then assert the un-asserted part of T in ASxUntil none of the inter-model construct equivalences in CEST has the status of 'valid'. End. Algorithm 6.7: Inter-model Construct Equivalence Reasoning Method 251 Discussion 2: This case deals with the situation in which the RHS construct-set of an inter-model construct equivalence is a subset of that of another inter-model construct equivalence which has been instantiated and asserted in the target application schema. 6.3 Construct-Equivalence-Based Schema Translation 6.3.1 Algorithin: Construct-Equivalence-Based Schema Translation With the construct equivalence transformation method and the construct equivalence reasoning method, the development of the construct-equivalence-based schema translation becomes straightforward. Since schema translation is directional (from soiu-ce to target data model), all construct equivalences need to be transformed, by using the construct equivalence transformation method described in Section 6.1, to establish chains of reasoning from the non-overlapped semantic space of the source data model via the overlapped semantic space, and finally to the non-overlapped semantic space of the target data model. Subsequently, according to the stage in the schema translation process, either the intra-model or the inter-model construct equivalence reasoning method will be adopted. As a result, the construct-equivalence-based schema translation consists of six steps, as depicted in Algorithm 6.8. The first three steps, involving the construct equivalence transformation method, are to transform the inter-model construct equivalences between and the intra-model constructs equivalences of the source and the target data model. The remaining three steps, adopting the construct equivalence 252 reasoning method, correspond to the three stages of the construct-equivalence-based schema translation. /* Input: CEs (intra-model construct equivalences of the source data model S), CEj (intra-model construct equivalences of the target data model T), CEsx (inter-model construct equivalences between S and T), ASg (application schema in the source data model), and A (aversion set, that is, undesired model constructs of T in the final application schema) Result: ASj (application schema in the target data model T) */ Begin InterCE-Transform(CEsx, 'S', 'T'). SourceCE-Transform(CEs, CEgx)TargetCE-Transform(CEx, CEst, A). IntraCE-Reasoning(CEs, ASs). InterCE-Reasoning(CEsx, ASg, AS^)IntraCE-Reasoning(CEx, ASj). End. /* Transforming Inter-model CE */ /* Transforming Intra-model CE of S */ /* Transforming Intra-model CE of T */ /* Source Convergence Stage */ /* Source-Target Projection Stage */ /* Target Enhancement Stage */ Algorithm 6.8: Algorithm for Construct-Equivalence-Based Schema Translation 6.3.2 Advantages of Construct-Equivalence-Based Schema Translation The intra-model and inter-model construct equivalences in the construct-equivalencebased schema translation are the schema translation knowledge whose representation is CEAL based on the model schema derived from the inductive metamodeling process. The transformation and reasoning required by the construct-equivalence-based schema translation are separated from the declaration of translation knowledge. Several advantages can be derived from the decomposition of translation knowledge into intermodel and intra-model construct equivalences and the separation of the translation knowledge from its transformation and reasoning methods. 253 L The separation allows model engineers to focus only on the specification of the translation knowledge without worrying about how the knowledge will be processed. Each knowledge element (e.g., an intra-model or inter-model construct equivalence) can be incorporated or removed independently without affecting others. Any incorrect translation can easily be attributed to its causing knowledge element. Since translation knowledge is represented in a declarative format and as independent elements, the specification, debugging and modification of translation knowledge can be performed more easily than for translation knowledge represented in a procedural form. 2. The constituents of translation knowledge make the problem decomposition possible. The task of specifying the translation knowledge for two data models is naturally decomposed into five subtasks (i.e., specifying the model schema for each data model, defining the intra-model construct equivalences for each data model, and determining the inter-model construct equivalences between these two data models). Furthermore, with the facilitation of the inductive metamodeling process described in Chapter 4, two of the tasks (i.e., specifying the model schema for each data model) essentially are automated without interaction with model designers. Thus, model engineers need not face a very large complicated task at one time. Rather, they deal with three smaller subtasks one at a time. 3. The characteristic of bi-directional construct equivalences supported in CEAL eliminates the need to specify two different sets of translation knowledge between two data models. 254 4. The model schema and the intra-model construct equivalences of a data model can be reused in specifying translation knowledge between the data model and different data models. This advantage is important in a large-scale MDBS environment since all local data models need to be translated into a common data model and the common data model needs to be translated into different external data models. Assume that model engineers have defined intra-model construct equivalences for the data model S based on the model schema derived by the inductive metamodeling process. When there is a need to perform the model translation between the data model S and T, what model engineers need to do is to specify intra-model construct equivalences for the data model T based on its model schema and the inter-model construct equivalences between S and T. The existing knowledge pertaining to the data model S can be reused and need not be re-specified. Afterward, if a need arises for the translation between S (or T) and another new data model N, the existing knowledge pertaining to the data model S (or T) can again be reused. As shown in these two examples, when defining the translation between a known data model and a new data model, the reusability of existing knowledge reduces the number of subtasks to be carried out firom three to two. If the intra-model construct equivalences of both data models already exist, the translation knowledge to be specified involves only the inter-model construct equivalences (i.e., the reusability of existing knowledge reduces the number of subtasks to be performed from three to one). 5. Automatic derivation of translation knowledge can be achieved in certain cases. Assuming that translation knowledge between the data model S and T and between T and N have been specified, if there is a need to perform translation between S and N, 255 one way is to perform a translation from S to T and then from T to N. Even though this approach requires no effort from model engineers to define the inter-model construct equivalences between S and N, its performance suffers due to the indirect translation through an intermediate data model (T). Another approach is to deduce the inter-model construct equivalences between S and N from the intra-model construct equivalences of S, T, and N and the inter-model construct equivalences between S and T as well as T and N. Whether all of the inter-model construct equivalences between S and N can be derived deserves further investigation. If not all of them can be deduced, partial results can complement the first approach to improve the performance of the indirect translation. 6. Finally, as explained in Section 2.3, the use of intra-model construct equivalences in the non-overlapped semantic spaces of the source data model and the target data model in the construct-equivalence-based schema translation approach provides a systematic way of minimizing the semantic loss and maximizing the semantic enhancement results. 6.4 Construct-Equivalence-Based Schema Normalization The algorithm for the construct-equivalence-based schema normalization is much simpler than the algorithm for the construct-equivalence-based schema translation since the former is a special case of the latter. The construct-equivalence-based schema normalization only involves 1) the transformation of the intra-model construct equivalence of the data model employed in this schema normalization process and 2) the 256 reasoning on the intra-model construct equivalences which is constrained by the normalization criteria specified for this normalization process. Hence, the constructequivalence-based schema normalization consists of two steps, as depicted in Algorithm 6.9. /* Input: CE (intra-model construct equivalences of the data model employed in this normalization process), AS (application schema in the data model), and A (aversion set that is, normalization criteria) Result: updated AS */ Begin NormalCE-Transform(CE, A). IntraCE-Reasoning(CE, AS). End. /* Transforming Intra-CE */ /* Normalization Process */ Algorithm 6.9: Algorithm for Construct-Equivalence-Based Schema Translation 257 CHAPTER 7 Contributions and Future Research Based on the implications of semantics characteristics of data models for schema translation and schema integration, a construct-equivalence-based methodology has been developed for the purposes of 1) overcoming the methodological inadequacies of existing schema translation approaches and the conventional schema integration process for large-scale MDBSs, 2) providing an integrated methodology for schema translation and schema normalization whose similarities of problem formulation have not been previously recognized, 3) inductively learning model schemas that provide a basis for declaratively specifying construct equivalences for schema translation and schema normalization. In previous chapters, we have detailed the development and some evaluation results of the construct-equivalence-based methodology. In this chapter, the contributions of this dissertation research will be summarized. Possible future research directions will also be discussed. 7.1 Contributions The contributions of this dissertation research will be discussed from two perspectives: those that are essential to the MDBS research and those to other research areas. 258 7.1.1 Contributions to MDBS Research The overall contribution of this dissertation research is to provide the missing theoretical foundation and effective techniques for an important problem, schema management, in large information systems which are rapidly proliferated by emerging IT such as Intranet/Internet. Specific contributions of this dissertation research are: I. Formalization of the construct-equivalence-based approach in a metamodeling paradigm for schema translation: The notion of inter-model and intra-model construct equivalences establishes a soimd foundation for formalizing the representation and reasoning of translation knowledge for schema translation, as shown in Figiare 2.6. To support the specification of inter-model and intra-model construct equivalences, the semantic spaces of data models are formalized as model schemas using a metamodel. Supported by construct equivalence transformation and reasoning methods, the construct-equivalence-based approach in a metamodeling paradigm provides a formal, declarative, bi-directional, and extensible technique for schema translation. Furthermore, as depicted in Section 6.3.2, the construct- equivalence-based schema translation technique facilitates the specification of translation knowledge by problem decomposition and reusability of existing translation knowledge. The use of intra-model construct equivalences can improve schema translation quality because semantic loss resulting firom the non-overlapped semantic space of the source data model can be minimized and semantic enhancement resulting fi^om the non-overlapped semantic space of the target data model can be maximized. 259 2. A pioneering attempt to facilitate schema integration using intra-model construct equivalences in solving structural conflicts: The conventional schema integration process treats structural conflicts and other types of conflicts (i.e., semantic, descriptive and extension conflicts) uniformly. However, as analyzed in Section 2.2.3, knowledge for identifying and resolving structural conflicts differs from that for identifying and resolving other types of conflicts. The former requires formalizing the semantics of model constructs and equivalences among model constructs of a data model, while the latter are based on understanding the data semantics of LDBSs. Schema normalization based on intra-model construct equivalences (defined on the model schema of a data model) separates the processing of structural conflicts from that of other types of conflicts and facilitates schema integration by reducing, if not totally resolving, structural conflicts before dealing with other types of conflicts. 3. An integrated methodology for schema translation and schema normalization in a large-scale MDBS environment: The notion of construct equivalences reveals similarities in problem formulation of schema translation and schema normalization. The construct-equivalence-based methodology in a metamodeling paradigm developed in this dissertation research provides an integrated technique for schema translation and schema normalization essential to schema management for large-scale MDBSs. Intra-model construct equivalences defined for schema translation can be utilized by schema normalization. Moreover, construct equivalence transformation and reasoning methods are applicable to both the schema translation and schema normalization processes. 260 4. Formalization of a metamodel for formalizing data models: As discussed in Section 3.2, any of the prevalent alternatives (i.e., EER, 00 and FOL) to the metamodel cannot fully satisfy the requirements of a metamodel. The SOGER model proposed in this dissertation integrates aspects of these alternatives in meeting the requirements of a metamodel. It consists of a set of semantics-rich structural, behavioral, and declarative constraint constructs to represent three components of a model: model constructs, model constraints and model semantics. 5. A technique for inductive metamodeling: The metamodeling paradigm for construct-equivalence-based schema translation and schema normalization requires metamodeling data models. The Abstraction Induction Technique for inductive metamodeling aims at automating the metamodeling process to make feasible and promising the construct-equivalence-based methodology. Satisfactory evaluation results (see Section 4.4) shows the effectiveness of using the Abstraction Induction Technique for inducing model schemas from application schemas without interaction with model designers. 6. Formal language for expressing schema translation and normalization knowledge: Past research failed to propose a formal and declarative language for expressing schema translation and normalization knowledge. This deficiency results in schema translation and schema normalization being procedural. With associated transformation and reasoning methods, CEAL serves as a formal and declarative language to represent schema translation and normalization knowledge in the form of construct equivalences. 261 7.1.2 Contributions to Other Research Areas 1. From a global perspective, schema translation and schema normalization for largescale MDBSs are within a broad research area of method/model integration and interoperability. The proposed research framework and integrated methodology based on the notion of construct equivalence in the metamodeling paradigm can be adopted or provide insights for methodology development for managing or integrating IT-related methods/models (e.g., system analysis and design methods, transaction management models, query language translation, etc.). 2. An efficient metamodeling process is an essential component of a meta-CASE tool. The Abstraction Induction Technique for inductive metamodeling can be adopted to provide this missing component of existing meta-CASE tools and could also be applied when integration of CASE tools or meta-CASE tools is inevitable. 7.2 Future Research Directions This dissertation encompasses a broad research domain that includes a metamodeling paradigm and metamodel development, inductive metamodeling technique, a formal and declarative construct equivalence assertion language, as well as construct-equivalencebased schema translation and schema normalization. formulate a theoretical framework Although initial efforts to for developing and evaluating the construct- equivalence-based methodology for schema translation and schema normalization have been described in this dissertation, much additional work is needed. 262 1. Empirical evaluation of the construct-equivalence-based methodology: More empirical evaluation of the Abstraction Induction Technique for inductive metamodeling should be conducted to validate its applicability to inducing model schemas of different data models. The effectiveness of bi-directional schema translation and schema normalization of the proposed construct-equivalence-based methodology should also be evaluated using real-world applications. Moreover, expressiveness and user-friendliness of CEAL also needs to be evaluated. 2. Development of a construct equivalence verification method: The quality of schema translation or schema normalization depends heavily on the correctness of the inter-model and intra-model construct equivalences provided. Thus, a method for verifying the completeness and consistency of construct equivalences is needed and essential to the construct-equivalence-based methodology. 3. Development of a construct equivalence induction/deduction method: Currently, model engineers play an important role in specifying inter-model and intra-model construct equivalences. Like the knowledge-engineer-driven knowledge engineering process in developing knowledge-based systems, the model-engineer-driven construct equivalence specification process is knowledge-intensive and usually errorprone and could become a bottleneck for schema translation and schema normalization. If inter-model and intra-model construct equivalences of two data models can be induced from translation or normalization examples from those two data models and/or can be deduced from their model schemas or existing construct equivalences related to intermediate data models, the problems pertaining to the 263 model-engineer-driven construct equivalence specification process can be overcome and the correctness of construct equivalences will be better ensured. 4. System development of the construct-equivalence-based methodology: Partial implementation (i.e.. Abstraction Induction Technique) of the construct-equivalencebased methodology has been prototyped in this dissertation research. More system development efforts involving the implementation of construct equivalence transformation and reasoning methods and linking of the current prototype of Abstraction Induction Technique to existing DBMSs are required. In addition to efforts for continuing the development and validation of the constructequivalence-based methodology, the following future research for extending this methodology to other research areas is proposed. 1. Extending the methodology to query language translation and integration of transaction management models: Query language translation and integration of transaction management models in an MDBS are important to its operations. Investigation of applicability and extension of the research framework and the construct-equivalence-based methodology to these two research issues are suggested. 2. Development of a meta-CASE tool incorporating the inductive metamodeling process: The Abstraction Induction Technique for inductive metamodeling can be adopted for development of a meta-CASE tool or the functionality enhancement of an existing meta-CASE tool. If the meta-CASE tool (existing or to be developed) into which the inductive metamodeling process will be incorporated is based on the SOOER metamodel, the Abstraction Induction Technique developed in Section 4.2 264 can be directly adopted without any modification. However, if it is not SOOERbased, some modification to the concept generalization and constraint generation rules would be needed. 265 APPENDIX A Relationships between Synthesized Taxonomy of Conflicts and Other Taxonomies Semantic Conflict Descriptive Conflict Structural Conflict Extension Conflict [RPR94] - Identification of related concepts Naming, key, behavioral, attribute incompatibility, abstraction level of attributes, and scaling conflicts Type conflict Level of accuracy, asynchronous updates, and lack of security [BLN86] Naming conflict - Structural conflict including type conflict [KS94] - Structural conflict including dependency, key, and behavioral conflicts - Abstraction level incompatibility including generalization and aggregation conflicts [SP91], [SPD92] Semantic conflict - Domain definition incompatibility including naming, data representation, scaling, data coding, default value, and integrity constraint conflicts - Entity definition Incompatibility including naming, identifier, schema isomorphism, (i.e., abstraction level of attributes), union compatibility (i.e., attribute incompatibility) conflicts Descriptive conflict including naming, attribute domain, scale, cardinalities, and operations. - Data value incompatibi­ lity including known inconsistency, temporal inconsistency and acceptable inconsistency Structural conflict Schematic Conflict Data valueattribute, attributeentity, and data valueentity conflicts 266 APPENDIX B Model Schema of A Relational Data Model MODEL: Relational { ENTITY-CLASS: Relation { ATTRIBUTE: Name (UNIQUENESS: unique NULL-SPEC: not-nuil MULTIPLICITY: single-valued) IDENTIFIER: Name CONSTRAINT: /* explicit constraints specific to the Relational model */ RC1: Vr: Relation, Vp: r.Primary-key, A = p.compose-of.Attribute => A c r.Attribute RC2: Vr: Relation, Vf: r.Foreign-key, A = f.participate.Attribute => A c r.Attribute /* implicit constraints instantiated from the metamodel semantics */ RC3: Vol: Relation, Vo2: Relation, ol^s: o2 => ol.Name ^ o2.Name RC4: Vo I: Relation => count(o 1 .Name) = 1 RC5: Vo 1: Relation => count(o 1 .aggregate-^Consist-of->Attribute) > 1 RC6: Vo 1: Relation => count(o 1 .aggregate->Consist-of->^Primary-Key) = 1 RC7: Vol: Relation count(ol.aggregate->Consist-of->Foreign-Key) > 0 ENTITY-CLASS: Attribute { ATTRIBUTE: (UNIQUENESS: unique Name NULL-SPEC: not-null MULTIPLICITY: single-valued) (UNIQUENESS: not-unique DataType NULL-SPEC: not-null MULTIPLICITY: single-valued) (UNIQUENESS: not-unique IsUnique NULL-SPEC: not-null MULTIPLICITY: single-valued) (UNIQUENESS: not-unique IsNull NULL-SPEC: not-null MULTIPLICITY: single-valued) DefaultValue (UNIQUENESS: not-unique 267 NULL-SPEC: null MULTIPLICITY: single-valued) IDENTIFIER: Name CONSTRAINT: /* explicit constraints specific to the Relational model */ ACl: Va: Attribute => aJsUnique e {unique, not-unique} AC2: Va: Attribute => aJsNull e {null, not-null} /* implicit constraints instantiated from the metamodel semantics */ ACS: Vol: Attribute, Vo2: Attribute, ol^^oZ => ol.Name vio2.Name AC4: Vo I: Attribute => count(o 1.Name) = 1 ACS: Vo1: Attribute => count(o 1 .DataType) = 1 AC6: Vo 1: Attribute => count(o 1.IsUnique) = 1 AC7: Vol: Attribute =>count(ol.IsNull) = 1 ACS: Vo1: Attribute count(o I .DefauItValue) < I AC9: Vo 1: Attribute => count(o1.Compose-of.Primary-Key) > 0 AC 10: Vo1: Attribute => count(o 1.Participate.Foreign-Key) > 0 ACII: V o1 : A t t r i b u t e = > c o u n t ( o 1. C o m p o s e - o f . P r i m a r y - K e y ) < 1 AC 12: Vo1: Attribute => count(o 1.Participate.Foreign-Key) < 1 } ENTITY-CLASS: Primary-Key { CONSTRAINT: /* the implicit constraints instantiated from the metamodel semantics *! PC 1: Vo 1: Primary-Key => count(o 1.Compose-of.Attribute) > 1 PC2: Vo1: Primary-Key => count(o 1.Referenced.Foreign-Key) > 0 ENTITY-CLASS: Foreign-Key { CONSTRAINT: /* implicit constraints instantiated from the metamodel semantics */ FCI: Vo1: Foreign-Key => count(o 1.Participate.Attribute) > I FC2: Vo I: Foreign-Key => count(o1.Referenced.Primary-Key) = 1 } RELATIONSHIP CLASS: Consist-of { TYPE: AGGREGATION AGGREGATE-CLASS: Entity COMPONENT-CLASS: Attribute (MIN-CARDINALITY: 1, MAX-CARDINALITY: m) Primary-Key (MIN-CARDINALITY: I, MAX-CARDINALITY: 1) Foreign-Key (MIN-CARDINALITY: 0, MAX-CARDINALITY: m) RELATIONSHIP CLASS: Compose-of { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: Primary-Key (MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) Attribute (MIN-CARDINALITY: 1, MAX-CARDINALITY: m) } RELATIONSHIP CLASS: Participate { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: Attribute (MIN-CARDINALITY: I, MAX-CARDINALITY: m) Foreign-Key (MIN-CARDINALITY: 0, MAX-CARDINALITY: I) } RELATIONSHIP CLASS: Referenced { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: Primary-Key (MIN-CARDINALITY: UMAX-CARDINALITY: 1) Foreign-Key (MIN-CARDINALITY: 0, MAX-CARDINALITY: m) } 269 APPENDIX C Common Separators and Their Implications to Concept Hierarchy Creation in Abstraction Induction Technique The set of commonly used separators includes (not exhaustively).,;:(){ }->=> = of. Separator > (and) { and } -> => of Implications to Concept Hierarchy Creation A dot at the end of a line indicates the end of an example. If a dot is prefixed and postfixed by terms without spaces in between, it implies that the prefixed term is a qualification term of the postfixed term or the postfixed term is a constituent concept of the prefixed term. In either case, a has-a link is implied from the prefixed term to the postfixed term. A comma separates two terms or two sequences of terms. These two terms (or sequences of terms) usually are in the same level of the concept hierarchy. Same as A colon usually is prefixed and postfixed by terms with or without spaces in between. It may indicate that the postfixed term is a constituent concept of the prefixed term; thus, a has-a link is implied from the prefixed term to the postfixed term. Any sequence of terms enclosed within an opening bracket and its corresponding ending bracket usually is the constituent concept of the term before the opening bracket. Hence, a set of has-a links are implied from the term before the opening bracket to each of terms enclosed within brackets. If the prior character of'(' is not a space or a separator, the pair of'(' and ')' are treated as part of a term. Same as '(' and ')'. If there is no space inmiediately before and after it may indicate the prefixed term is a qualification term of the postfixed term or the postfixed term is a constituent concept of the prefixed term. Thus, a hasa liiik is implied from the prefixed term to the postfixed term. Same as An equality sign indicates that the postfixed term or sequence of terms are the constituent concept of the prefixed term. A has-a link is implied from the prefixed term to the postfixed tenn(s). The word 'of indicates that the prefixed term is the constituent concept of the postfixed term, or that the postfixed term is the qualification term of the prefixed term. In either case, a has-a link is implied from the postfixed term to the prefixed one. 270 APPENDIX D Evaluation Study 1: Relational Model Schema Induced from A University Health Center Database Property Hierarchies: Data-type = {int, integer, char(n), text(n), double, integer, character(n), numeric} char(n) = {char(lO), char(20)} Null-spec = {null, not-null} Stopword and Keywords: ## STOPWORD CREATE DEFINE REFERENCE ## KEYWORD RELATION TABLE ATTRIBUTE Training Examples: ## TABLE DISPENSE (TREATMENT_NO DOUBLE NOT-NULL, MED_CODE DOUBLE NOT-NULL, MED_QTY INTEGER NOT-NULL, PRIMARY-KEY TREATMENT_NO MED_CODE) ## TABLE DOCTOR (DOCTOR_ID DOUBLE NOT-NULL, NAME TEXT(30) NOT-NULL, PRIMARY-KEY DOCTOR_ID) ## TABLE DRUGS (CODE DOUBLE NOT-NULL, MED_NAME TEXT(30) NOT-NULL, MED_DESC MEMO NOT-NULL, 271 USE_METHOD MEMO NOT-NDLL, DNIT TEXT(5) NOT-NULL, PRIMARY-KEY CODE) ## TABLE STUDENT (STUDENT_ID INTEGER NOT-NULL, NAME TEXT(30) NOT-NULL, ADDRESS TEXT(100) NULL, DEPT TEXT(4) NULL, AGE INTEGER NOT-NULL, GENDER TEXT(l) NOT-NULL, TEL TEXT(15) NULL, STATUS TEXTd) NOT-NULL, CREDIT INTEGER NULL, REMARKS MEMO NULL, DATE_BIRTH DATE NOT-NULL, DATE_REGISTER DATE NOT-NULL, REGISTER_STATUS YES/NO NOT-NULL, REGISTER_TIME TIME NULL, TREATMENT_ROOM INTEGER NOT-NULL, PRIMARY-KEY STUDENT_ID) ## TABLE TREATMENT (TREATMENT_NO COUNTER NOT-NULL, D_ID DOUBLE NOT-NULL, S_ID INTEGER NOT-NULL, TREATMENT_DATE DATE NOT-NULL, TREATMENT_TIME TIME NULL, DIAGNOSIS MEMO NULL, DISPENSE_STATUS YES/NO NOT-NULL, PRIMARY-KEY TREATMENT_NO) ## TABLE SPECIALTIES (DOCTOR_ID DOUBLE NOT-NULL, SPECIALTY TEXT(20) NOT-NULL, PRIMARY-KEY DOCTOR_ID SPECIALTY) ## FOREIGN-KEY (SPECIALTIES.DOCTOR_ID) REFERENCE (DOCTOR.DOCTOR_ID) ## FOREIGN-KEY (DISPENSE.TREATMENT_NO) REFERENCE (TREATMENT.TREATMENT_NO) ## FOREIGN-KEY (DISPENSE.MED_CODE) REFERENCE (DRUGS.CODE) ## 272 FOREIGN-KEY (TREATMENT.D_ID) REFERENCE (DOCTOR.DOCTOR_ID) ## FOREIGN-KEY (TREATMENT.S ID) REFERENCE (STDDENT.STUDENT ID) Induced Model Schema: ENTITY-CLASS: PRIMARY-KEY { } ENTITY-CLASS: FOREIGN-KEY { } ENTITY-CLASS: C0NSTRUCT_1 { ATTRIBUTES: name: TYPE: char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued DATA-TYPE: TYPE: char(15) UNIQUENESS: not-unique NULL-SPEC: not-null MULTIPLICITY: single-valued NULL-SPEC: TYPE: char(15) UNIQUENESS: not-xinique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_l: forall ol:C0NSTRUCT_1, forall o2:C0NSTRUCT_1, ol 1= o2 => ol.name 1= o2.nanie Constraint_2: forall ol:C0NSTRUCT_1 => count(ol.name) = 1 Constraint_3: forall a:C0NSTRUCT_1 => a.DATA-TYPE in {iNT, CHAR(N), TEXT(N), DOUBLE, INTEGER, CHARACTER(N), NUMERIC, DATE, MEMO, MEMO, COUNTER, MEMO, YES/NO, TIME, COUNTER, TIME, MEMO, YES/NO} Constraint_4: forall ol:C0NSTRUCT_1 => coxmt(ol.DATA-TYPE) = 1 Constraint_5: forall a:C0NSTRUCT_1 => a.NULL-SPEC in {NULL, NOT-NULL} Constraint_6: forall ol:C0NSTRUCT_1 => coxint(ol.NULL-SPEC) = 1 } 273 ENTITY-CLASS: TABLE { ATTRIBUTES: name: TYPE: Char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_7: forall ol:TABLE, forall o2:TABLE, ol != o2 => ol.name != o2.name Constraint_8: forall ol:TABLE => covmt(ol.name) = 1 Constraint_14: forall c:TABLE, forall e:c.PRIMARY-KEY, A=e.RELATE_1.C0NSTRUCT_1 => A included-in c.C0NSTRUCT_1 Constraint_18: forall c:TABLE, forall e:c.FOREIGN-KEY, A=e.RELATE_2.C0NSTRUCT_l => A included-in c.C0NSTRUCT_1 } RELATIONSHIP-CLASS: C0NSIST_0F_1 { TYPE: AGGREGATION AGGREGATE: TABLE COMPONENT: C0NSTRUCT_1 (MIN-CARDINALITY: 2, MAX-CARDINALITY: PRIMARY-KEY (MIN-CARDINALITY: 1, MAX-CARDINALITY: FOREIGN-KEY (MIN-CARDINALITY: 0, MAX-CARDINALITY: CONSTRAINT: Constraint_9: forall ol:TABLE => count(ol.aggregate->CONSIST_OF_l->component.CONSTRUCT_l) Constraint_10: forall ol:TABLE => count(ol.aggregate->CONSIST_OF_l->component.PRIMARY-KEY) Constraint_ll: forall ol:TABLE => count(ol.aggregate->C0NSIST_0F_1->component.FOREIGN-KEY) M) 1) M) >= 2 = 1 >= 0 RELATIONSHIP-CLASS: RELATE_1 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: PRIMARY-KEY (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) C0NSTRUCT_1 (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: M) CONSTRAINT: Constraint_12: forall ol:PRIMARY-KEY => count(ol.REFER_TO->RELATE_l->REFERED_BY.CONSTRUCT_l) >= 1 Constraint_13: forall ol:C0NSTRUCT_1 => count(ol.REFERED_BY->RELATE_l->REFER_TO.PRIMARY-KEY) >= 0 } 274 RELATIONSHIP-CLASS: REIjATE_2 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: FOREIGN-KEY (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) C0NSTRUCT_1 (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_15: forall ol:FOREIGN-KEY => coimt(ol.REFER_TO->RELATE_2->REFERED_BY.C0NSTRUCT_1) = 1 Constraint_16: forall ol:C0NSTRUCT_1 => covmt(ol.REFERED_BY->RELATE_2->REFER_T0.FOREIGN-KEY) >= 0 Constraint_17: forall ol:C0NSTRUCT_1 => coimt(ol.REFERED_BY->RELATE_2->REFER_T0.FOREIGN-KEY) <= 1 } RELATIONSHIP-CLASS: RELATE_3 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: FOREIGN-KEY (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: M) PRIMARY-KEY (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_19: forall ol:FOREIGN-KEY => count(ol.REFER_T0->RELATE_3->REFERED_BY.PRIMARY-KEY) = 1 Constraint_20: forall ol:PRIMARY-KEY => count(ol.REFERED_BY->RELATE_3->REFER_TO.FOREIGN-KEY) >= 0 Constraint_21: forall ol:PRIMARY-KEY => covmt(ol.REFERED_BY->RELATE_3->REFER_T0.FOREIGN-KEY) <= M } 275 APPENDIX E Evaluation Study 2: Network Model Schema Induced from A Hypothetical Company Database Property Hierarchies: Data-type = {int, integer, char(n), text(n), double, integer, character(n), numeric} char(n) = {char(lO), char(20)} Null-spec = {null, not-null} Stopword and Keywords: ## STOPWORD TYPE IS ASCENDING DECENDING ## KEYWORD RECORD SET INSERTION RETENTION OWNER MEMBER Training Examples: ## RECORD IS EMPLOYEE {NOT-DUPLICATE SSN, NOT-DUPLICATE FNAME MINIT LNAME, FNAME TYPE IS CHARACTER(15), MINIT TYPE IS CHARACTER(1) , LNAME TYPE IS CHARACTER(15) , SSN TYPE IS CHARACTER(9), DEPTNAME TYPE IS CHARACTER(IS)} ## RECORD IS DEPARTMENT {NOT-DUPLICATE NAME, NAME TYPE IS CHARACTER(15), NUMBER TYPE IS INTEGER, LOCATION TYPE IS CHARACTER(15), MGRSTART TYPE IS CHARACTER(15)} ## RECORD IS PROJECT {NOT-DUPLICATE NAME, NOT-DUPLICATE NUMBER, NAME TYPE IS CHARACTER(15), NUMBER TYPE IS INTEGER, LOCATION TYPE IS CHARACTER(15)} ## RECORD IS WORKS_ON {NOT-DUPLICATE ESSN PNUMBER, ESSN TYPE IS CHARACTER(9), PNUMBER TYPE IS INTEGER, HOURS TYPE IS NUMERIC} ## SET IS WORKS_FOR {OWNER IS DEPARTMENT, MEflBER IS EMPLOYEE, INSERTION IS MANUAL, RETENTION IS OPTIONAL} ## SET IS MANAGES {OWNER IS EMPLOYEE, MEMBER IS DEPARTMENT, INSERTION IS AUTOMATIC, RETENTION IS MANDATORY} ## SET IS CONTROLS {OWNER IS DEPARTMENT, MEMBER IS PROJECT, INSERTION IS AUTOMATIC, RETENTION IS MANDATORY} ## SET IS P_WORKSON {OWNER IS PROJECT, MEMBER IS WORKS_ON, INSERTION IS MANUAL, RETENTION IS FIXED} Induced Model Schema: ENTITY-CLASS: NOT-DUPLICATE { } ENTITY-CLASS: C0NSTRUCT_1 { ATTRIBUTES: name: TYPE: char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued DATA-TYPE: TYPE: char(15) UNIQUENESS: not-unique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_l: forall ol:C0NSTRUCT_1, forall o2:C0NSTRUCT_1, ol != o2 => ol.name != o2.name Constraint_2: forall ol:C0NSTRUCT_1 => count(ol.name) = 1 Constraint_3: forall a:C0NSTRUCT_1 => a.DATA-TYPE in {iNT, CHAR(N), INTEGER, CHARACTER(N), NUMERIC, DATE} Constraint_4: forall ol:C0NSTRUCT_1 => count(ol.DATA-TYPE) } ENTITY-CLASS: RECORD { ATTRIBUTES: name: TYPE: char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_5: forall ol:RECORD, forall o2:REC0RD, ol != o2 => ol.name != o2.name Constraint_S: forall ol:REC0RD => count(ol.name) = 1 Constraint_17: forall c:REC0RD, forall e:c.NOT-DUPLICATE, A=e.RELATE_1.CONSTRUCT 1 => A included-in c.CONSTRUCT } RELATIONSHIP-CLASS: C0NSIST_0F_1 { TYPE: AGGREGATION AGGREGATE: RECORD COMPONENT: NOT-DUPLICATE (MIN-CARDINALITY: 1, MAX-CARDINALITY 278 C0NSTRUCT_1 (MIN-CARDINALITY: 3, MAX-CARDINALITY: M) CONSTRAINT: Constraint_7: forall ol:RECORD => coiint (ol.aggregate->CONSIST_OF_l->coniponent.NOT-DUPLICATE) >= 1 Constraint_8: forall ol:RECORD => count(ol.aggregate->CONSIST_OF_l->component.CONSTRUCT_l) >= 3 } ENTITY-CLASS: SET { ATTRIBUTES: name: TYPE: char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued INSERTION: TYPE: char(15) UNIQUENESS: not-unique NULL-SPEC: not-null MULTIPLICITY: single-valued RETENTION: TYPE: char(15) UNIQUENESS: not-unique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_9: forall ol:SET, forall o2:SET, ol != o2 => ol.name != o2.name Constraint_10: forall ol:SET => coiint (ol.name) = 1 Constraint_ll: forall a:SET => a.INSERTION in {MANUAL, AUTOMATIC} Constraint_12: forall ol:SET => coxmt(ol.INSERTION) = 1 Constraint_13: forall a:SET => a.RETENTION in {OPTIONAL, MANDATORY, FIXED} Constraint_14: forall ol:SET => count(ol.RETENTION) = 1 } RELATIONSHIP-CLASS: RELATE_1 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: NOT-DUPLICATE (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) C0NSTRUCT_1 (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: M) CONSTRAINT: Constraint_15: forall ol:NOT-DUPLICATE => count(ol.REFER_TO->RELATE_1->REFERED_BY.C0NSTRUCT_1) >= 1 Constraint_16: forall ol:CONSTRUCT_l => count(ol.REFERED BY->RELATE 1->REFER TO.NOT-DUPLICATE) >= 0 279 } RELATIONSHIP-CLASS: RELATE_2 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: SET (ROLE: REFER_TO, MIN-CARDINALITY: 1, MAX-CARDINALITY: 1) RECORD (ROLE: REFERED_BY, MIN-CARDINALITY: 2, MAXCARDINALITY: 2) CONSTRAINT: Constraint_18: forall ol:SET => count (ol.REFER_T0->RELATE_2->REFERED_BY.RECORD) = 2 Constraint_19: forall ol:RECORD => count(ol.REFERED_BY->RELATE_2->REFER_T0.SET) >= 1 } 280 APPENDIX F Evaluation Study 3: Hierarchical Model Schema Induced from A Hypothetical Company Database Property Hierarchies: Data-type = {int, integer, char(n) , text{n.), doiible, integer, character(n), numeric} char(n) = {chardO), char(20)} Null-spec = {null, not-null} Stopword and Keywords: ## STOPWORD IS OF ASCENDING DECENDING ASC DESC ## KEYWORD RECORD ROOT CHILD-NUMBER POINTER Training Examples: ## RECORD EMPLOYEE {ROOT OF HIERARCHIES.HIERARCHY2, FNAME CHARACTER(15), MINIT CHARACTER(1), LNAME CHARACTER {15), SSN CHARACTER(9), BDATE CHARACTER(9), KEY SSN} ## RECORD DEPARTMENT {ROOT OF HIERARCHIES.HIERARCHYl, DNAME CHARACTER{15), DNUMBER INTEGER, KEY DNAME, KEY DNDMBER} ## RECORD DMANAGER {PARENT DEPARTMENT, CHILD-NUMBER 1, MGRSTARTDATE CHARACTER(9), POINTER MPTR EMPLOYEE} ## RECORD PROJECT {PARENT DEPARTMENT, CHILD-NUMBER 4, PNAME CHARACTER(15), PNUMBER INTEGER, PLOCATION CHARACTER(15), KEY PNAME, KEY PNUMBER} ## RECORD PWORKER {PARENT PROJECT, CHILD-NUMBER 1, HOURS CHARACTER(4), POINTER WPTR EMPLOYEE} ## RECORD DEMPLOYEES {PARENT DEPARTMENT, CHILD-NUMBER 2, POINTER EPTR EMPLOYEE} ## RECORD ESUPERVISEES {PARENT EMPLOYEE, CHILD-NUMBER 2, POINTER SPTR EMPLOYEE} ## RECORD DEPENDENT {PARENT EMPLOYEE, CHILD-NUMBER 1, DEPNAME CHARACTER(15), SEX CHARACTER(1), BIRTHDATE CHARACTER(9) , RELATIONSHIP CHARACTER(10) } ## HIERARCHIES {HIERARCHYl, HIERARCHYa} Induced Model Schema: ENTITY-CLASS: KEY ENTITY-CLASS: PARENT ENTITY-CLASS: POINTER ENTITY-CLASS: C0NSTRUCT_1 { ATTRIBUTES: name: TYPE: Char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued DATA-TYPE: TYPE: Char(15) UNIQUENESS: not-unique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_l: forall ol:C0NSTRUCT_1, forall o2:C0NSTRUCT_1 ol != o2 => ol.name != o2.name Constraint_2: forall ol:C0NSTRUCT_1 => count(ol.name) = 1 Constraint_3: forall a:C0NSTRUCT_1 => a.DATA-TYPE in {iNT, CHAR(N), INTEGER, CHARACTER(N), NUMERIC, DATE} Constraint_4: forall ol:C0NSTRUCT_1 => count(ol.DATA-TYPE) } ENTITY-CLASS: RECORD { ATTRIBUTES: name: TYPE: char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued 283 CHILD-NDMBER: TYPE: char(15) UNIQUENESS: not-unique NULL-SPEC: null MULTIPLICITY: single-valued CONSTRAINT: Constraint_5: forall ol:RECORD, forall o2:RECORD, ol != o2 => ol.name != o2.name Constraint_6: forall ol:RECORD => count(ol.name) = 1 Constraint_7: forall a:RECORD => a.CHILD-NUMBER in {l, 4, 2} Constraint_8: forall ol:RECORD => count(ol.CHILD-NUMBER) <= 1 Constraint_24: forall c:RECORD, forall e:c.KEY, A=e.RELATE_2.C0NSTRUCT_1 => A included-in c.C0NSTRUCT_1 } RELATIONSHIP-CLASS: C0NSIST_0F_1 { TYPE: AGGREGATION AGGREGATE: RECORD COMPONENT: C0NSTRUCT_1 (MIN-CARDINALITY: 0, MAX-CARDINALITY: M) KEY (MIN-CARDINALITY: 0, MAX-CARDINALITY: M) PARENT (MIN-CARDINALITY: 0, MAX - CARDINALITY: 1) POINTER (MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_9: forall ol:RECORD => count (ol.aggregate->CONSIST_OF_l->component.CONSTRUCT_l) >= 0 Constraint_10: forall ol:RECORD => count (ol.aggregate->C0NSIST_0F_1->component.KEY) >= 0 Constraint_ll: forall ol:RECORD => count (ol.aggregate->CONSIST_OF_l->component.PARENT) >= 0 Constraint_12: forall ol:RECORD => count(ol.aggregate->CONSIST_OF_l->component.PARENT) <= 1 Constraint_13: forall ol:RECORD => count(ol.aggregate->CONSIST_OF_l->component.POINTER) >= 0 Constraint_14: forall ol:RECORD => coiant (ol.aggregate->CONSIST_OF_l->component.POINTER) <= 1 ENTITY-CLASS: HIERARCHIES { ATTRIBUTES: name: TYPE: char(15) UNIQUENESS: unique NULL-SPEC: not-null MULTIPLICITY: single-valued CONSTRAINT: Constraint_15: forall ol:HIERARCHIES, forall o2:HIERARCHIES, ol != o2 => ol.name != o2.name 284 Constraint_lS: forall ol:HIERARCHIES => count(ol.name) = 1 } RELATIONSHIP-CLASS: RELATE_1 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: RECORD (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) HIERARCHIES (ROLE: REFERED_BY, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_17: forall ol:RECORD => count(ol.REFER_TO->RELATE_l->REFERED_BY.HIERARCHIES) >= 0 Constraint_18: forall ol:RECORD => count(ol.REFER_TO->RELATE_1->REFERED_BY.HIERARCHIES) <= 1 Constraint_l9: forall ol:HIERARCHIES => count(ol.REFERED_BY->RELATE_l->REFER_TO.RECORD) >= 0 Constraint_20: forall ol:HIERARCHIES => count(ol.REFERED_BY->RELATE_l->REFER_TO.RECORD) <= 1 } RELATIONSHIP-CLASS: RELATE_2 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: KEY (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) C0NSTRUCT_1 (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_21: forall ol:KEY => count(ol.REFER_TO->RELATE_2->REFERED_BY.C0NSTRUCT_1) = 1 Constraint_22: forall ol:C0NSTRUCT_1 => count(ol.REFERED_BY- >REIiATE_2- >REFER_TO.KEY) >= 0 Constraint_23: forall ol:C0NSTRUCT_1 => count(ol.REFERED_BY->RELATE_2->REFER_T0.KEY) <= 1 } RELATIONSHIP-CLASS: RELATE_3 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: PARENT (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) RECORD (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_25: forall ol:PARENT => count(ol.REFER TO->RELATE 3->REFERED BY.RECORD) = 1 Constraint_2S: forall ol:RECORD => count(ol.REFERED_By- >RELATE_3 - >REFER_TO.PARENT) Constraint_27: forall ol:RECORD => count (ol.REFERED_BY- >RELATE_3 - >REFER_TO.PARENT) } RELATIONSHIP-CLASS: RELATE_4 { TYPE: ASSOCIATION ASSOCIATED-ENTITY-CLASS: POINTER (ROLE: REFER_TO, MIN-CARDINALITY: 0, MAX-CARDINALITY: 1) RECORD (ROLE: REFERED_BY, MIN-CARDINALITY: 1, MAX-CARDINALITY: 1) CONSTRAINT: Constraint_28: forall ol:POINTER => count(ol.REFER_TO- >RELATE_4 - >REFERED_BY.RECORD) Constraint_29: forall ol:RECORD => count (ol.REFERED_BY- >RELATE_4- >REFER_TO.POINTER) Constraint_30: forall ol:RECORD => count (ol.REFERED_BY->RELATE_4 - >REFER_TO.POINTER) } 286 APPENDIX G Inter-Model Construct Equivalences between the EER and SOOER Models 1. Each relation whose primary key attributes do not contain any of its foreign key attributes in the relational model is equivalent to an entity-class in the SOOER model. The name of the entity-class corresponds to that of the relation. R: Relational.Relation (WHERE R.Primary-Key.Attribute.Foreign-Key = 0) E: SOOER.Entity-Class WITH E.Name = R.Name 2. In the relational model, each relation 1) whose primary key contains a foreign key to a single relation and non-foreign key attributes, and 2) whose attributes are its primary key is equivalent to a new entity-class and a new association relationship which relates the new entity-class and the entity-class for the referenced relation in the SOOER model. The name of the entity-class corresponds to that of the relation. The attribute(s) of the new entity-class is the non-foreign key attribute(s) of the relation and becomes the primary key attribute(s) of the new entity-class. R l : Relational.Relation (WHERE Rl.Primary-Key.Attribute - Rl .Foreign-Key.Attribute ^ 0 AND Rl.Attribute = Rl.Primary-Key.Attribute) AND R2: Relational.Relation (WHERE R2.Primary-Key.Attribute = Rl.Primary-Key.Attribute.Foreign-Key.Attribute) E l : SOOER.Entity-Class AND E2: SOOER.Entity-Class AND A: SOOER.Association WITH E2 = R2 AND A.relate.Entity-Class = {El, E2} AND El.Name = Rl.Name AND El.Attribute = Rl.Attribute - R2.Primary-Key.Attribute AND El.Primary-Key.Attribute = El.Attribute AND A.relate.El (Max-Card) = m AND A.relate.El(Min-Card) = 1 IF ext(Rl.Primary-Key.Attribute n Rl .Foreign-Key.Attribute) = ext(R2.Primary-key.Attribute) AND A.relate.El (Min-Card) = 0 287 IF ext(Rl.Primary-Key.Attribute n Rl.Foreign-Key.Attribute) * ext(R2.Primary-key.Attribute) AND A.reiate.E2(Min-Card) = 1 AND A.reIate.E2(Max-Card) = m IF has_duplicate(ext(Rl.Primary-Key.Attribute Rl.Foreign-Key.Attribute)) = true AND A.relate.E2(Max-Card) = 1 IF has_duplicate(ext(Rl.Primary-Key.Attribute Rl.Foreign-Key.Attribute)) = false 3. Each relation R whose primary key is a foreign key to a single relation in the relational model is equivalent to a new entity-class and a new specialization relationship in the SOGER model. The new entity-class is the superclass and the entity-class for the relation referenced by the foreign key is the subclass of the specialization relationship. The name of the specialization relationship corresponds to that of R. R l : Relational.Relation AND R2: Relational.Relation (WHERE R2.Primary-Key = Rl .Foreign-Key.Primary-Key AND R2.Primary-Key.Attribute = Rl.Primary-Key.Attribute) E l : SOOER.Entity-Class AND E2: SOOER.Entity-Class AND S: SOOER.Specialization WITH E2 = R2AND S.superclass.Entity-Class = {E2} AND S.subclass.Entity-Class = (El} AND El.Name = Rl.Name AND Rl.Foreign-Key.Attribute = Rl .Foreign-Key.Attribute Rl.Primary-Key.Attribute AND Rl.Attribute = Rl.Attribute - R2.Attribute 4. In the relational model, each relation whose primary key attributes are foreign key attributes to more than one relations is equivalent a new association relationship which connects to the entity-classes each of which is for the relation referenced by the foreign key in the SOOER model. The name of the association relationship is the same as that of R. R: Relational.Relation AND S: {T: R.Primary-Key.Attribute.Foreign-Key.Primary-Key.Relation} (WHERE count(S) > 1) D: {E: SOOER.Entity-Class} AND 288 A: SOOER.Association WITH E = T AND A.relate.Entity-Class = D AND A.Name = R.Name AND R.Attribute = R.Attribute - union(E.Identifier.Attribute) AND R.Primary-Key.Attribute = R.Priinary-Key.Attribute union(E.Identifier.Attribute) AND R.Foreign-Key.Attribute = R.Foreign-Key.Attribute uiiion(E.Identifier.Attribute) AND A.relate.E(Max-Card) = m IF has_dupIicate(ext(R.Primary-Key.Attribute - T.Primary-Key.Attribute)) = true AND A.relate.E(Max-Card) = 1 IF has_duplicate(ext(R.Primary-Key.Attribute - T.Primary-Key.Attribute)) = false 5. Each non-foreign key attribute of a relation in the relational model is equivalent to a single-valued attribute of the entity-class for the relation in the SOOER model. The name (or data-type) of the single-valued attribute is the same as that of the nonforeign key attribute. The null-specification of the single-valued attribute is determined by whether the non-foreign key attribute allows null values or not. The uniqueness property of the single-valued attribute corresponds to the IsUnique property of the non-foreign attribute. R: Relational.Relation AND N: R.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class AND A: E.Attribute WITH E = RAND A.multiplicity = 'single-valued' AND A.Name = N.Name AND A.Null-Spec = N.IsNull AND A.Uniqueness = N.IsUnique AND A.DataType = N.DataType 6. Each non-foreign key attribute of a relation in the relational model corresponds to a single-valued attribute of the association relationship for the relation in the SOOER model. The name (or data-type) of the single-valued attribute is the same as that of the non-foreign key attribute. The null-specification of the single-valued attribute is determined by whether the non-foreign key attribute allows null values or not. The uniqueness property of the single-valued attribute corresponds to the IsUnique property of the non-foreign attribute. 289 R; Relational.Relation AND N: R.Attribute - R.Foreign-Key.Attribute E: SOOER-Association (WHERE E = R) AND A: E.Attribute WITH A.Name = N.Name AND A.Multiplicity = 'single-valued' AND A.Null-Spec = N.IsNull AND A.Uniqueness = N.IsUnique AND A.DataType = N.DataType 7. Each non-foreign key primary key attribute of a relation in the relational model is equivalent to an identifier attribute of the entity-class for the relation in the SOOER model. R: Relational.Relation AND K: R.Primary-Key.Attribute - R.Foreign-Key.Attribute E: SOOER.Entity-Class WITH E = RAND E.Identifier.Attribute = E.Identifier.Attribute u {K} 8. Each foreign key which is not part of the primary key in the relational model is equivalent to an association relationship in the SOOER model which connects to the entity-classes for the relation where the foreign key is defined and the relation referenced by the foreign key. Furthermore, if the extension of the foreign key contains duplicate tuples, the maximum cardinality on the entity-class for the relation where the foreign key is defined is m; otherwise it is 1. If the extension of the foreign key is the same as that of the primary key to which the foreign key refers, the minimum cardinality on the entity-class for the relation where the foreign key is defined is 1; else, it is 0. The maximum cardinality on the entity-class for the relation referenced by the foreign key is 1. If the extension of the foreign key contains null values, the minimum cardinality on the entity-class for the relation referenced by the foreign key is 0; otherwise, it is 1." R: Relational.Relation AND F: R.Foreign-Key (WHERE F.Attribute n R.Primary-Key.Attribute = 0) AND R2: F.Primary-Key.Relation E l : SOOER.Entity-Class AND E2: SOOER.Entity-Class AND A; SOOER.Association WITH E l = R A N D E2 = R2 AND 290 A.relate.Entity-Class = {El, E2} AND A.reIate.El (Max-Card) = m IF has_dupIicate(ext(F.Attribute)) = true AND A.relate.El (Max-Card) = 1 IF has_dupIicate(ext(F.Attribute)) = false AND A.relate.El(Min-Card) = I IF ext(F.Attribute) = ext(F.Primary-key.Attribute) AND A.relate.El (Min-Card) = 0 IF ext(F.Attribute) ^ ext(F.Primary-key.Attribute) AND A.reIate.E2(Max-Card) = 1 AND A.relate.E2(Min-Card) = 0 IF has_null(ext(F.Attribute)) = true AND A.relate.E2(Min-Card) = 1 IF has_null(ext(F.Attribute)) = false AND R.Attribute = R.Attribute - R.Foreign-Key.Attribute 291 APPENDIX H Intra-Model Construct Equivalences of the SOOER Model A multi-valued attribute in the SOOER model is equivalent to an entity-class, a single-valued attribute and an association relationship in the SOOER model. The association relationship connects the new entity-class and the existing entity-class to which the multi-valued attribute belongs. The multi-valued attribute is removed from its entity-class. The single-valued attribute becomes the identifier of the new entityclass. As the identifier of the new entity-class, the uniqueness and null-spec properties of the single-valued attribute are 'unique' and 'not-null', respectively. The name and data type of the single-valued attribute are the same as those of the multi­ valued attribute. If different instances of the existing entity-class can share the same value of the multi-valued attribute (i.e., the uniqueness property of multi-valued attribute is 'not-unique'), the maximum cardinality on the existing entity-class is m; otherwise, it is 1. The minimum cardinality on the existing entity-class is always 1 since each value of the multi-valued attribute was associated to an instance of the existing entity-class. On the other hand, as a result of transformed from a multi­ valued attribute, the maximum cardinality on the new entity-class is always m. That the IsNull property of the multi-valued attribute is 'null' means that the an instance of the existing entity-class may not have any value in this multi-valued attribute; resulting the minimum cardinality on the new entity-class is 0. If the IsNull property of the multi-valued attribute is 'not-null', the minimirai cardinality on the new entityclass is 1. E: Entity-Class AND M: E.Attribute (WHERE M.Multiplicity = 'multi-valued') N: Entity-Class AND A; Attribute AND R: Association WITH A.Multiplicity = 'single-valued' AND R.relate.Entity-Class = {E, N} AND N.Attribute = {A} AND N.IdentifierAttribute = N.Attribute AND A.Name = M.Name AND A.DataType = M.DataType AND A.Uniqueness = 'unique' AND A.Null-Spec = 'not-nuir AND R.relate.E(Max-Card) = m IF M.Uniqueness = 'not-unique' AND R.relate.E(Max-Card) = 1 IF M.Uniqueness = 'unique' AND 292 R.reIate.E(Min-Card) = 1 AND R.reIate.N(Max-Card) = m AND R.reIate.N(Min-Card) = 0 IF M.Null-Spec = 'null' AND R.relate.N(Min-Card) = 1 IF M.Null-Spec = 'not-null' 2. A set of entity-classes having the same identifier as another entity-class whose extension on its identifier attributes is a superset of the union of that of the formers are equivalent to a new specialization relationship in which the latter is the superclass and the formers are the subclasses. Furthermore, the identifier attributes of each subclass will be removed. C: Entity-Class AND T: {E: Entity-Class (WHERE E.ldentifier.Attribute = C.Identifier.Attribute)} (WHERE union(ext(E.Identifier.Attribute)) e ext(C.Identifier.Attribure)) S: Specialization WITH S.superclass.Entity-Class = {C} AND S.subclass.Entity-Class = T AND E.ldentifier = E.Identifier - C.Identifier AND E.Attribute = E.Attribute - C.Identifier.Attribute 3. If a set of entity-classes share the same identifier as another entity-class and the extension of the latter is a superset of the union of the formers, the set of association relationship between the latter entity-class and each of the formers are equivalent to a new specialization relationship with the latter as the superclass and the formers are the subclasses. Furthermore, the identifier attributes of each subclass will be removed. C: Entity-Class AND T: {E: Entity-Class (WHERE E.ldentifier.Attribute = C.Identifier.Attribute)} (WHERE union(ext(E.Identifier.Attribute)) c ext(C.Identifier.Attribute)) R: {A: E.relate.Association (WHERE A.relate.Entity-Class = {C, E})} S: Specialization WITH S.superclass.Entity-Class = {C} AND S.subclass.Entity-Class = T AND E.ldentifier = E.Identifier - C.Identifier AND E.Attribute = E.Attribute - C.Identifier.Attribute 4. A set of entity-classes sharing the same identifier are equivalent to a new specialization with a new entity-class as the superclass and the set of entity-classes as the subclasses. The identifier and common attributes of the subclasses will be promoted to the new superclass. T: {E: Entity-Class} (WHERE equal(E.ldentifier.Attribute) = true) 293 N: Entity-Class AND S: Specialization WITH S.superclass.Entity-Class = {N} AND S.subclass.Entity-CIass = T AND N.Attribute = intersect(E.Attribute) AND N.IdentifienAttribute = E.IdentifierAttribute AND E.Attribute = E.Attribute - intersect(E.Attribute) 5. A set of specialization relationships each of which has a single subclass and shares the same superclass are equivalent to a new specialization relationship with their superclass as the superclass and the set of subclasses of those specialization relationships as the subclasses of the new specialization relationship. E: Entity-Class AND S: {T: Specialization (WHERE count(T.subclass.Entity-Class) = I) AND T.superclass.Entity-Class = E)} AND B: {C: T.subclass.Entity-Class} N: Specialization WITH N.superclass.Entity-Class = {E} AND N.subclass.Entity-Class = B 6. If each entity-class E is not associated with any aggregation relationship and has the maximimi and minimum cardinality as 1 on every entity-class C directly linked to E via an association relationship, the entity-class E and the set of association relationships are equivalent to a new association relationship connecting to all Cs. The minimal and maximum cardinality on each C with the new association relationship are the same as those on E with the association relationship to C. All attributes of E become attributes of the new association relationship. E; Entity-Class (WHERE E.aggregate.Aggregation = 0 AND E.component.Aggregation = 0) AND S: {A: E.relate.Association (WHERE count(A.related.Entity-CIass) = 2)} AND O: {C: A.relate.Entity-Class (WHERE C E AND A.relate.C(min-card) = 1 AND A.relate.C(max-card) =1)} (WHERE count(O) = count(S)) R: Association WITH R.relate.Entity-Class = O AND R.relate.C(min-card) = A.relate.E(niin-card) AND R.relate.C(max-card) = A.relate.E(max-card) AND R.Attribute = E.Attribute 294 7. An aggregation relationship is equivalent to a set of association relationships each of which connects a component-class to the aggregate-class of the aggregation relationship. The maximum and minimum cardinality on each component-class with its new association relationship are the same as those on the component-class with the aggregation relationship. The maximum and minimal cardinality of the aggregate-class with each new association relationship are I, respectively. A: Aggregation AND G: A.aggregate.Entity-Class AND E: (C; A.component.Entity-Class} S: {R: Association} WITH R.relate.Entity-CIass = {G, C} AND R.relate.C(min-card) = A.component.C(min-card) AND R.relate.C(max-card) = A.component.C(max-card) AND R.relate.C(min-card) = 1 AND R.relate.C(max-card) =1 295 REFERENCES [ADD91] Ahmed, R., DeSchedt, P., Du, W., Kent, W., Ketabchi, M., Litwin, W., Rafii, A. and Shan, M.-C., "The Pegasus Heterogeneous Multidatabase System," IEEE Computer, Vol. 24, No. 12, December 1991, pp.19-27. [ADK91] Ahmed, R., DeSchedt, P., Kent, W., Ketabchi, M., Litwin, W., Rafii, A. and Shan, M.-C., "Pegasus: A System for Seamless Integration of Heterogeneous Information Sources," Proceedings of the IEEE International Computer Conference (COMPCON '91), March 1991, pp.128—136. [AB89] Alonso, R. and Barbard, D., "Negotiating Data Access in Federated Database Systems," Proceedings of Fifth International Conference on Data Engineering, Los Angeles, CA, Feb. 6-10, 1989, pp.56-65. [ADSY93] Andersson, M., Dupont, Y., Spaccapietra, S., Yetongnon, K., Tresch, M. and Ye, H., "FEMUS: A Federated Multilingual Database System," Advanced Database Systems, N. R. Adam and B. K. Bhargava (Eds.), Springer-Verlag, Berlin, German, 1993, pp.359-380. [AT91] Atzeni, P. and Torlone, R., "A Metamodel Approach for the Management of Multiple Models in CASE Tools," Proceedings of Database and Expert Systems Applications, K. Karagiannis (Ed.), Germany, 1991, pp.350-355. [AT93] Atzeni, P. and Torlone, R., "A Metamodel Approach for the Management of Multiple Models and the Translation of Schemes," Information Systems, Vol. 18, No. 6, 1993. pp.349-362. [BLN86] Batini, C., Lenzerini, M., and Navathe, S. B., "A Comparative Analysis of Methodologies for Database Schema Integration," ACM Computing Surveys, Vol. 18, No. 4, December 1986, pp.323-364. [BCN92] Batini, C., Ceri, S., and Navathe, S. B., Conceptual Database Design: An Entity-Relationship Approach, The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1992. [BW95] Benjamin, R. and Wigand, R., "Electronic Markets and Virtual Value Chains on the Information Superhighway," Sloan Management Review, Winter 1995, pp.62-72. [B87] Blakey, M., "Basis of a Partially Informed Distributed Database," Proceedinsg of I3th International Conference on Very Large Data Bases, Brighton, UK., 1987, pp.381-388. [BM91] Bloedom, E. and Michalski, R., "Data Driven Constructive Induction in AQ17-PRE: A Method and Experiments," Proceedings of the Third International Conference on Tools for AI, San Jose, CA, 1991, pp.30-37. [BKG93] Bouguettaya, A., King, R., Galligan, D. and Simmons, J., "Implementation of Interoperability in Large Multidatabases," Proceedings of Third international 296 Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS '93), Vienna, Austria, April 19-20, 1993, pp.55-60. [BOT86] Breitbart, Y., Olson, P. and Thompson, G., "Database Integration in a Distributed Heterogeneous Database System," Proceedings of 2nd International Conference on Data Engineering, 1986, pp.301—310. [BHP92] Bright, M. W., Hurson, A. R., and Pakzad, S. H., "A Taxonomy and Current Issues in Multidatabase Systems," Computer, Vol. 25, No. 3, Mar. 1992, pp.50-60. [BF92] Brinkkemper, S. and Falkenberg, E. D., "Three Dichotomies in the Information System Methodology," Technical Report, Department of Information Systems, University of Nijmegen, The Netherland, 1992. [BdL89] Brinkkemper, S., de Lange, M., Looman, R. and van der Steen, F. H. G. C., "On the Derivation of Method Companionship by Meta-Modelling," Technical Report No. 89-5, Department of Informatics, University of Nijmegen, March 1989. [B78] Brodie, M. L., Specification and Verification of Database Semantic Integrity, Ph.D Thesis, Department of Computer Science, University of Toronto, Toronto, 1978. [CMM83] Cabonell, J. G., Michalski, R. S. and Mitchell, T. M., "An Overview of Machine Learning," Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.) Tioga Publishing Company,1983, pp.3-23. [CCH89] Cai, Y., Cercone, N. and Han, J., "Learning Characteristic Rules from Relational Databases," Proceedings of International Symposium of Computational Intelligence, Milano, Italy, September 1989, pp.187-196. [CCH90] Cai, Y., Cercone, N. and Han, J., "An Attribute-Oriented Approach for Learning Classification Rules from Relational Databases," Proceedings of the sixth International Conference on Data Engineering, 1990, pp.281-288. Carbonell, J. G., "Introduction: Paradigms for Machine Learning," Machine [C90] Learning: Paradigms and Methods, J. G. Carbonell (Ed.), A Bradford Book, Cambridge, MA, 1990, pp.1-9. [CN89] Clark, P. and Niblett, T., "The CN2 Induction Algorithm," Machine Leaning, Vol 3, 1989, pp.261-283. [DM83] Dietterich, T. G. and Michalski, R. S., "A Comparative Review of Selected Methods For Learning From Examples," Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.) Tioga Publishing Company, 1983, pp. 41-81. [DW91] Dilts, D. M. and Wu, W., "Using Knowledge-based Technology to Integrate CIM Databases," IEEE Transactions on Knowledge and Data Engineering, Vol. 3, No. 2, June 1991, pp.237-245. 297 [DKS92] Du, W., Krishnamurthy, R. and Shan, M.-C., "Query Optimization in Heterogeneous DBMS," Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada, 1992, pp.277-29l. [EN94] Ebnasri, R. and Navathe, S. B., Foundamentals of Database Systems, 2nd Ed., The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1994. [FN93] Fankhauser, P. and Neuhold, E. J., "Knowledge Based Integration of Heterogeneous Databases," Interoperable Database Systems (DS-5) (A-25), D. K- Hsiao, E. J. Neuhold and R. Sacks-Davis (Eds.), Elsevier Science Publishers, North-Holland, 1993, pp.155-175. [FS84] Feigenbaum, E and Simon, H., "EPAM-like Models of Recognition and Learning," Cognitive Sci, Vol. 8,1984, pp. 305-336. [F87] Fisher, D., "Knowledge Acquisition via Incremental Conceptual Clustering," Machine Learning, Vol. 2, 1987, pp. 139-172. [GLF90] Gennari, J,. Langley, P. and Fisher, D., "Models of Incremental Concept Formation," Machine Learning: Paradigms and Methods, J. G. Carbonell (Ed.), The MIT Press, Cambridge, Massachusetts, London, England, 1990, pp.10-61. [GRS94] Georgakopoulos, D., Rusinjiewicz, M., and Sheth, A. P., "Using Tickets to Enforce the Serializability of Multidatabase Transactions," IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 1, Feb. 1994, pp.166—180. [G93] Getta, J. R., "Translation of Extended Entity-Relationship Database Model into Object-Oriented Database Model," Interoperable Database Systems (DS5) (A-25), D. K. Hsiao, E. J. Neuhold and R. Sacks-Davis (Eds.), Elsevier Science Publishers B. V., North-Holland, 1993, pp. 87-100. [GCK92] Goldkuhl, G., Cronholm, S. and Krysander, C., "Adaptation of CASE Tools to Different Systems Development Methods," Proceedings of the 15th IRIS '92 Conference, Larkollen, Norway, August 1992. [GD93] Gonzalez, A. J. and Dankel, D. D., The Engineering of Knowledge-Based Systems: Theory and Practice, Prentice Hall, Englewood Cliffs, NJ, 1993. [HM85] Heimbigner, D. and McLeod, D., "A Federated Architecture for Information Management," ACM Transactions on Office Information Systems, Vol. 3, No. 3, July 1985, pp.253-278. [H092] Heym, M. and Osterle, H., "A Semantic Data Model for Methodology Engineering," Proceedings of the 5th CASE '92 Workshop, G. Forte and N. Madhayji (Eds.), 1992. [HC93] Hsieh, S.-Y., Chang, C. K., Mongkolwat, P., Pilch, W. W. Jr., and Shih, C.C., "Capturing the Object-Oriented Database Model in Relation Form," Proceedings of the 7th Annual International Computer Software and Applications Conference, 1993, pp.202-208. 298 [1091] Ishakbeyoglu, N. S. and Ozsoyoglu, Z. M., "Maintenance of Semantic Integrity Constraints Under Database Updates," Proceedings of the Sizth International Symposium on Computer and Information Sciences, Vol I, M. Baray and B. Ozgii? (Eds.), Elsevier Science Publishers B.V., 1991, pp.l25134. [JJ95] Jeusfeld, M. A. and Johnen, U. A., "An Executable Meta Model for ReEngineering of Database Schemas," International Journal of Cooperative Information Systems, Vol. 4, No. 2 & 3,1995, pp.237—258. [KK92] Kamel, M. N. and Kamel, N. N., "Federated Database Management System: Requirements, Issues and Solutions," Computer Communications, Vol. 15, No. 4, May 1992, pp.270-278. [KM86] Kedar-Cabelli and Mahadevan, S., "Bibliography of Recent Machine Learning Research," Machine Learning, Vol. II, R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.), Morgan Kaufinann Publishers, Inc., Los Altos, CA, 1986, pp.671-705. [K95] Kim, W., "Introduction to Part 2: Technology for Interoperating Legacy Databases," Chapter 25 in Modern Database Systems, W. BCim (Ed.), Addison-Wesley Publishing Company, Reading, MA, 1995, pp.515-520. [KS94] Komatzky, Y. and Shoval, P., "Conceptual Design of Object-Oriented Database Schemas Using the Binary-Relationship Model," Journal of Data & Knowledge Engineering, Vol. 14, 1994, pp265—288. [K94] Kramer, S., "CN2-MCI: A Two-Step Method for Constructive Induction," Proceedings of Workshop on Constructive Induction and Change of Representation, July 10,1994. [LS83] Lenzerini, M. and Santucci, G., "Cardinality Constraints in the EntityRelationship Model," Entity-Relationship Approach to Software Engineering, C. G. Davis, S. Jajodia, P. A. Ng and R. T. Yeh (Eds.), Elsevier Scicence Publishers, 1983, pp.529-549. [LW91] Liu Sheng, O. R. and Wei, C.-P., "Object-Oriented Modeling and Design of Coupled KnowIedge-based/Database Systems," Proceedings of International Conference on Data Engineering, Phoenix, AZ, 1991, pp.98-105. [LWH94a] Liu Sheng, 0. R., Wei, C.-P., Hu, P. J.-H. and Han, T., "Analysis and Design of A Distributed Intelligent Multimedia Information System for Supporting Medical Image Reading," Proceedings of 1994 Pacific Workshop on Distributed Multimedia Systems, February 26, 1994, Taipei, Taiwan, pp.l09128. [LWH94b]Liu Sheng, O. R., Wei, C.-P. and Hu, P. J.-H., "Engineering Patient Image Retrieval Knowledge," The Journal of Knowledge Engineering and Technology, Vol. 7, No. 2, Summer/Fall 1994, pp.45-61. [LOG92] Lu, H., Ooi, B.-C. and Goh, C.-H., "On Global Multidatabase Query Optimization," SIGMOD Record, Vol. 21, No. 4, December 1992, pp.6-11. 299 [MYB87] Malone, T. W., Yates, J. and Benjamin, R. I., "Electronic Markets and Electronic Hierarchies," Communications of the ACM, Vol. 30, No. 6, June 1987, pp.484-497. [MHR93] McCormack, J. I., Halpin, T. A. and Ritson, P. R., "Automated Mapping of Conceptual Schemas to Relational Schemas," Proceedings of 5th International Conference, CAiSE '93, C. Rolland, F. Bodart and C. Cauvet (Eds.), Paris, France, June 8-11,1993, pp.432-448. [MRB92] Mehrotra, S., Rastogi, R., Breitbart, Y., Korth, H. F., and Silberschatz, A., "The Concurrency Control Problem in Miiltidatabases: Characteristics and Solutions," ACMSIGMOD Record, June 1992, pp.288-297. [MY95] Meng, W. and Yu, C., "Query Processing in Multidatabase Systems," Chapter 27 in Mordern Database Systems: The Object Model, Interoperability, and Beyond, W. Kim (Ed.), Addison-Wesley Publishing Company, Reading, MA, 1995, pp.551-572. [Mi86] Michalski, R., "Understanding the Nature of Learning: Issues and Research Directions," Machine Learning, Vol. II, R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.), Morgan Kaufinarm Publishers, Inc., Los Altos, CA, 1986, pp.3-25. [MMH86] Michalski, R., Mozetic, I., Hong, J. and Lavrac, N., "The Multi-purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains," Proceedings of the Fifth National Conference on Artificial Intelligence, Morgan Kaufinann, Philadelphia, PA, 1986, pp.10411045. [M77] Mitchell, T., "Version Spaces: A Candidate Elimination Approach to Rule Learning", In Proceedings IJCAI-77, 1977. [M78] Mitchell, T., Version Spaces: An Approach to Concept Learning, PhD Thesis, Stanford University, Stanford, CA., 1987. [M84] Morgenstem, M., "Constraint Equations: Declarative Expression of Constraints with Automatic Enforcement," Proceedings of International Conference on Very Large Data Bases, 1984, pp. 111—125. [M86] Morgenstem, M., "The Role of Constraints in Databases, Expert Systems, and Knowledge Representation," Proceedings of the First International Conference on Expert Database Systems, L. Kerschberg (Ed.), The Benjamin/Cummings Publishing Company, 1986, pp.351-368. [P94] Pfahringer, B., "CiPF 2.0: A Robust Constructive Induction System," Proceedings of Workshop on Constructive Induction and Change of Representation, July 10,1994. [QW86] Qian, X. and Wiederhold, G., "Knowledge-based Integrity Constraint Validation," Proceedings of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, August 1986, pp.3-12. [Q83] Quinlan, J. R., "Learning Efficient Classification Procedures and Their Application to Chess End Games," Machine Learning: An Artificial 300 [Q93] [RPR94] [RK91] [SCG93] [SBD93] [SYE90] [SK86] [SL90] [SSU91] [S78] [SP91] [SPD92] Intelligences Approach, R.S, Michalski, J. G. Carbonell and T. M. Mitchell (Eds.), Morgan Kaufinann, Los Altos, CA, 1983, pp.463-482. Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufinann Publishers, San Mateo, CA, 1993. Reddy, M. P., Prasad, B. E., Reddy, P. G. and Gupta, A., "A Methodology for Integration of Heterogeneous Database," IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 6, December 1994, pp.920-933. Rich, E. and BCnight, K., Artificial Intelligence, 2nd Ed., McGraw-Hill, Inc., New York, NY, 1991. Saltor, F., Castellanos, M. G. and Garcia-Solaco, M., "Overcoming Schematic Discrepancies in Interoperable Databases," Interoperable Database Systems (DS-5) (A-25), D. K. Hsiao, E. J. Neuhold and R. Sacks-Davis (Eds.), Elsevier Science Publishers, North-Holland, 1993, pp.191-205. Santucci, G., Batini, C. and Di Battista, G., "Multilevel Schema Integration," Proceedings of 12th International Conference on the Entity-Relationship Approach (ER '93), R. A. Elmasri, V. Kouramajian and B. Thalheim (Eds.), Arlington, TX, December 15-17, 1993, pp.327-338. Scheuermann, P., Yu, C., Ehnagarmid, A., Garcia-Molina, H., Manola, F., McLeod, D., Rosenthal, A. and Templeton, M., "Report on the Workshop on Heterogeneous Database Systems," SIGMOD Record, Vol. 19, No. 4, December 1990, pp.23-31. Shepherd, A. and Kerschberg, L., "Constraint Management in Expert Database Systems," Proceedings of the First International Conference on Expert Database Systems, L. Kerschberg (Ed.), The Benjamin/Cummings Publishing Company, 1986, pp.309-331. Sheth, A. P. and Larson, J. A., "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases," ACM Computing Surveys, Vol. 22, No. 3, September 1990, pp. 183-236. Silberschatz, A., Stonebraker, M. and Ullman, J. F., "Database Systems: Achievements and Opportunities," Communications of the ACM, Vol. 34, No. 10, October 1991, pp.111-120. Stepp, R., "The Investigation of the UNICLASS Program AQ7UNI and User's Guide," Technical Report No. 949, Department of Computer Science, University of Illinois at Urbana-Champaign, 1978. Spaccapietra, S. and Parent, C., "Conflicts and Correspondence Assertions m Interoperable Databases," SIGMOD Record, Vol. 20, No. 4, December 1991, pp.49-54. Spaccapietra, S., Parent, C. and Dupont, Y., "Model Independent Assertions for Integration of Heterogeneous Schemas," The VLDB Journal, Vol. 1, No. 1, July 1992. 301 [SP94] [SZ93] [T93] |TBC87] [U89] [UL93] [USR93] [vBtH91] [vG87] [VP88] [WCN92] Spaccapietra, S. and Parent, C., "View Integeration: A Step Forward in Solving Structure Conflicts," IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No.2, April, 1994, pp258-274. Steele, P. M. and Zaslavsky, A. B., "The Role of Meta Models in Federating System Modelling Techniques," Proceedings of 12th International Conference on Entity-Relationship Approach, R. A. Elmasri, V. Kouramajian and B. ThaUieim (Eds.), Arlington, TX, Dec. 15-17, 1993, pp.315-326. Tari, A., "Interoperability between Database Models," Interoperable Database Systems (DS-5) (A-25), D. K. Hsiao, E. J. Neuhold and R. SacksDavis (Eds.), Elsevier Science Publishers B. V., North-Holland, 1993, pp.101-118. Templeton, M., Brill, D., Chen, A., Dao, S., Lund, E., McGregor, R. and Ward, P., "Mermaid: A Front-End to Distributed Heterogeneous Databases," Proceedings of IEEE, Vol. 75, Vol. 5, May 1987, pp.695—708. Urban, A. D., "ALICE: An Assertion Language for Integrity Constraint Expression," Proceedings of the Computer Software Applications Conference, Orlando, FL, 1989, pp.292-299. Urban, S. D. and Lim, B. B. L., "An Intelligent Framework for Active Support of Database Semantics," International Journal of Expert Systems, Vol. 6, No. 1, 1993, pp.1-37. Urban, S. D., Shah, J. J., and Rogers, M., "An Overview of the ASU Engineering Database Project: Interoperability in Engineering Design," Proceedings of 3rd International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS '93), Vienna, Austria, April 19-20, 1993, pp.73-76. van Bommel, P., ter Hofstede, A. H. M. and van der Weide, Th. P., "Semantics and Verification of Object-Role Models," Information Systems, Vol. 16, No. 5,1991, pp.471-495. van Gigch, J. P., "Methodological Comparison of the Science Systems and Metasystem Paradigms," Decision Making about Decision Making: Metamodels and Metasystems, J. P. van Gigch (ed.). Abacus Press, Cambridge, MA, 1987, pp.19-33. Veijalainen, J. and Popesch-Zeletin, R., "Multidatabase Systems in ISO/OSI Environment," Standards in Information Technology and Industrial Control, Malagardis, N. and Williams, E. (Eds.), North-Holland, The Netherlands, 1988, pp.83-97. Whang, W. K., Chakravarthy, A. and Navathe, S. B., "Heterogeneous Databases: Inferring Relationships for Merging Component Schemas, and a Query Language," Technical Report, UF-CIS-TR-92-048, University of Florida, December 1992. 302 [WM94] [W75] [W92] [WD92] [YL93] [YP93] [Z94] [ZSC95] Wnek J. and Michalski, R., "Hypothesis-Driven Constructive Induction in AQ17-HCI: A Method and Experiments," Machine Learning, Vol. 14, No. 2, 1994, pp. 139-168. Winston, P., "Learning Structural Descriptions From Examples," Chapter 5 in The Psychology of Computer Vision, P. Winston (Ed.), McGraw Hill, New York, 1975. Winston, P. H., Artificial Intelligence, 3rd Ed., Addison-Wesley Publishing Company, Reading, MA, 1992. Wu, W. and Dilts, D. M., "Integrating Diverse CIM Data Bases: The Role of Natural Language Interface," IEEE Transactions on Systems, Man, and Cybernetics, Vol. 22, No. 6, November/December 1992, pp.1331-1347. Yan, L.-L. and Ling, T.-W., "Translating Relational Schema with Constraints Into OODB Schema," Interoperable Database Systems (DS-5) (A-25), D. K. Hsiao, E. J. Neuhold and R. Sacks-Davis (Eds.), Elsevier Science Publishers B. v., North-Holland, 1993, pp.69-85. Yang, J. and Papazoglou, M. P., "Determining Schema Interdependencies in Object-Oriented Multidatabase Systems," Proceedings of the Third International Symposium on Database Systems for Advanced Applications, S. Moon and H. Ikeda (Eds.), Taejon, Korea, April 6-8, 1993, pp.47-54. Zhao, J. L. "Schema Coordination in Federated Database Systems," Proceedings of the 4th Workshop on Information Technologies and Systems (WITS'94), Vancouver, Canada, December 17-18,1994. Zhao, J. L., Segev, A. and Chatteqee, A., "A Universal Relation Approach to Federated Database Management," Preceedings of International Conference on Data Engineering, Taipei, Taiwan, 1995.