Matching dependencies with arbitrary attribute values

Leopoldo Bertossi

Matching dependencies with arbitrary attribute values

Leopoldo Bertossi

2011, Proceedings of the 4th International Workshop on Logic in Databases - LID '11

visibility

…

description

8 pages

link

1 file

Matching dependencies (MDs) are used to declaratively specify the identification (or matching) of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the pure case of MDs, any value from the underlying data domain can be used for the value in common that does the matching. We investigate the semantics and properties of data cleaning through the enforcement of matching dependencies for the pure case. We characterize the intended clean instances and also the clean answers to queries as those that are invariant under the cleaning process. The complexity of computing clean instances and clean answers to queries is investigated. Tractable and intractable cases depending on the MDs are characterized.

Matching Dependencies with Arbitrary Attribute Values: ∗ Semantics, Query Answering and Integrity Constraints Jaffer Gardezi University of Ottawa, SITE Ottawa, Canada [email protected] † Leopoldo Bertossi Carleton University, SCS Ottawa, Canada [email protected] ABSTRACT Matching dependencies (MDs) are used to declaratively specify the identification (or matching) of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the pure case of MDs, any value from the underlying data domain can be used for the value in common that does the matching. We investigate the semantics and properties of data cleaning through the enforcement of matching dependencies for the pure case. We characterize the intended clean instances and also the clean answers to queries as those that are invariant under the cleaning process. The complexity of computing clean instances and clean answers to queries is investigated. Tractable and intractable cases depending on the MDs are characterized. 1. INTRODUCTION A database instance can be seen as a model of an external reality. As such, it may contain several tuples and values in them that refer to the same external entity. In consequence, the database may be modeling the same entity in different forms, as different entities, which most likely is not the intended representation. This problem could be caused by errors in data, by data coming from different sources that use different formats or semantics, etc. In this case, the database is considered to contain dirty data, and it must undergo a cleansing process that goes through two interlinked phases: detecting tuples (or values therein) that should be matched or identified, and, of course, doing the actual matching. This problem is usually called entity resolution, data fusion, duplicate record detection, etc. Cf. [9, 7] for some recent surveys and [2] for more recent work in the area. ∗Research supported by the NSERC Strategic Network on Business Intelligence (BIN,ADC05) and NSERC/IBM CRDPJ/371084-2008. †Faculty Fellow of the IBM CAS. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LID 2011 March 25, 2011, Uppsala, Sweden Copyright 2011 ACM 978-1-4503-0609-6 ...$10.00. Iluju Kiringa University of Ottawa, SITE Ottawa, Canada [email protected] Quite recently, and generalizing entity resolution, [10, 11] introduced matching dependencies (MDs), which are declarative specifications of matchings of attribute values that should hold under certain conditions. MDs help identify duplicate data and enforce their merging by exploiting semantic knowledge. Loosely speaking, an MD is a rule defined on a database which states that, for any pair of tuples from given relations within the database, if the values of certain attributes of the tuples are similar, then the values of another set of attributes should be considered to represent the same object. In consequence, they should take the same values. Here, similarity of values can mean equality or a domain-dependent similarity relationship, e.g. related to some metric, such as the edit distance. Example 1. Consider the following database instance of a relation P . Name Phone Address John Smith 723-9583 10-43 Oak St. J. Smith (750) 723-9583 43 Oak St. Ap. 10 Similarity of the names in the two tuples (as measured by, e.g. edit distance) is insufficient to establish that the tuples refer to the same person. This is because the last name is a common one, and only the first initial of one of the names is given. However, similarity of their phone and address values indicates that the two tuples may be duplicates. This is expressed by an MD which states that, if two tuples from P have similar address and phone, then the names should match. In the notation of MDs, this is expressed by P [P hone] ≈ P [P hone] ∧ P [Address] ≈ P [Address] → P [N ame] ⇋ P [N ame]. ✷ The identification in [10, 11] of this new class of dependencies and their declarative formulation have become important additions to data cleaning research. In this work we investigate matching dependencies, starting by introducing our own refinement of the model-theoretic and dynamic semantics of MDs introduced in [11]. Any method of querying a dirty data source must address the issue of duplicate detection in order to obtain accurate answers. Typically, this is done by first cleaning the data by discarding or combining duplicate tuples and standardizing formats. The result will be a new database where the entity conflicts have been resolved. However, the entity resolution problem may have different solution instances (which we will simply call solutions), i.e. different clean versions of the original database. The model-theoretic semantics that we propose and investigate defines and characterizes the class of solutions, i.e. of intended clean instances. After a clean instance has been obtained, it can be queried as usual. However, the query answers will then depend on the particular solution at hand. So, it becomes relevant to characterize those query answers that are invariant under the different (sensible) ways of cleaning the data, i.e. that persist across the solutions. This is an interesting problem per se. However, it becomes crucial if one wants to obtain semantically clean answers while still querying the original dirty data source. This kind of virtual cleaning (and query answering on top of it) has been investigated in the area of consistent query answering (CQA) [1], where, instead of MDs, classical integrity constraints (ICs) are considered, and database instances are repaired in order to restore consistency (cf. [5, 3, 8] for surveys of CQA). Virtual approaches to robust query answering under entity resolution and enforcement of matching dependencies are certainly unavoidable in virtual data integration systems. In this paper we make the following contributions, among others: 1. We revisit the semantics of MDs introduced in [11], pointing out sensible and justified modifications of it. A new semantics for MD satisfaction is then proposed and formally developed. 2. Using the new MD semantics, we formally define the intended solutions for a given, initial instance, D0 , that may not satisfy a given set of MDs. They are called minimally resolved instances (MRIs) ,and are obtained through an iteration process that stepwise enforces the MDs until a stable instance is reached. The resulting instances minimally differ from D0 wrt the number of changes of attribute values. This semantics (and the whole paper) considers the pure case introduced in [11], in the sense that the values that can be chosen to match attribute values are arbitrarily taken from the underlying data domains. No matching functions are considered, like in [2], where entire tuples are merged, not individual attribute values; or [6], where MDs with matching functions are investigated. 3. We introduce the notion of resolved answers to a query posed to D0 . They are the answers that are invariant under the MRIs. 4. We investigate the computability and complexity of computing MRIs and resolved answers, identifying cases where computing (actually, deciding) resolved answers is intractable. This paper is organized as follows. Section 2 presents basic concepts and notations needed in the rest of the paper. Section 3 identifies some problems with the MD semantics, and refines it to address them. It also introduces the resolved instances and resolved answers to a query. Section 4 considers the problems of computing resolved instances and resolved query answers. Section 5 presents some final conclusions. Some proofs of results can be found in the appendix. 2. PRELIMINARIES In general terms, we consider a relational schema S that includes an enumerable infinite domain U . An instance D of S can be seen as a finite set of ground atoms of the form R(t̄), where R is a database predicate in S, and t̄ is a tuple of constants from U . We assume that each database tuple has an identifier, e.g. an extra attribute that acts as a key for the relation and is not subject to updates. In the following it will not be listed, unless necessary, as one of the attributes of a database predicate. It plays an auxiliary role only, to keep track of updates on the other attributes. R(D) denotes the extension of R in D. We sometimes refer to attribute A of R by R[A]. If the ith attribute of predicate R is A, for a tuple t = (c1 , . . . , cj ) ∈ R(D), t[A] denotes the value ci . The symbol t[Ā] denotes the vector whose entries are the values of the attributes in the vector Ā. The attributes may have infinite subdomains that are contained in U . Constants will be denoted by lower case letters at the beginning of the alphabet. A matching dependency [10], involving predicates R(A1 , . . . , An ), S(B1 , . . . , Bm ), is a rule of the form ^ i∈I,j∈J R[Ai ] ≈ij S[Bj ] → ^ R[Ai ] ⇋ S[Bj ]. (1) i∈I ′ ,j∈J ′ Here R and S could be the same predicate. I, I ′ and J, J ′ are fixed subsets of {1, . . . , n} and {1, . . . , m}, resp. We assume that, when Ai , Bj are related via ≈ij or ⇋ in (1), they share the same (sub)domain, so their values can be compared by the domain-dependent binary similarity predicate, ≈ij or can be identified, resp. In this paper, we will assume that there is at most one similarity operator defined on the domain of any given attribute. The similarity operators, generically denoted with ≈, are assumed to have the properties of: (a) Symmetry: If x ≈ y, then y ≈ x. (b) Equality subsumption: If x = y, then x ≈ y. The MD in (1) is implicitly universally quantified in front, and applied to pairs of tuples t1 , t2 for R and S, resp. There are two complimentary ways of interpreting this MD:V a static interpretation and a dynamic one. The expression R[Ai ] ≈ij S[Bj ] states that the values of the attributes Ai in tuple t1 are similar to those of attributes Bj in tuple t2 . In the static interpretation, the MD is read as an implication, similar to a functional dependency (FD). It says that, if V R[Ai ] ≈ij S[Bj ] holds, then for each pair Ai and Bj such that R[Ai ] ⇋ S[Bj ] appears on the RHS and for the same tuples t1 and t2 , t1 [Ai ] and t2 [Bj ] are equal. The dynamic interpretation of the MD states that if this similarity condition holds, such pairs of attributes should be updated so that they become the same for t1 and t2 . However, the attribute values to be used for this matching are left unspecified by (1). The static interpretation is useful for identifying dirty data, while the dynamic interpretation specifies a procedure for cleaning the data. For abbreviation, we will sometimes write MDs as R[Ā] ≈ S[B̄] → R[C̄] ⇋ S[Ē], (2) where Ā, B̄, C̄, and D̄ represent the lists of attributes, (A1 , ..., Ak ), (B1 , ..., Bk ), (C1 , ..., Ck′ ), and (E1 , ..., Ek′ ), respectively. We refer to the pairs of attributes (Ai , Bi ) and (Ci , Ei ) as corresponding pairs of attributes of the pairs (Ā, B̄) and (C̄, Ē), respectively. For an instance D and a pair of tuples t1 ∈ R(D) and t2 ∈ S(D), t1 [Ā] ≈ t2 [B̄] indicates that the similarities of the values for all corresponding pairs of attributes of (Ā, B̄) hold. Similarly, t1 [C̄] = t2 [Ē] denotes the equality of the values of all pairs of corresponding attributes of (C̄, Ē). In the dynamic interpretation, an MD involves an update operation. This leads to a concept of satisfaction of an MD by a pair of database instances: an instance D and its updated instance D′ . Definition 1. [11] Let D, D′ be instances of schema S with predicates R and S, such that, for each tuple t in D, there is a unique tuple t′ in D′ with the same identifier as t, and viceversa. The pair (D, D′ ) satisfies the MD m in (2), denoted (D, D′ ) F m, iff, for every pair of tuples tR ∈ R(D) and tS ∈ S(D), if tR and tS satisfy tR [Ā] ≈ tS [B̄], then for the corresponding tuples t′R and t′S in R(D′ ), S(D′ ), resp., it holds: (a) t′R [C̄] = t′S [Ē], and (b) t′R [Ā] ≈ t′S [B̄]. ✷ ′ Intuitively, D in Definition 1 is an instance obtained from D by enforcing m on instance D. For a set M of MDs, and a pair of instances (D, D′ ), (D, D′ ) F M means that (D, D′ ) F m, for every m ∈ M . An instance D′ is stable [11] for a set M of MDs if (D′ , D′ ) F M . Stability of an instance is a static concept analogous to satisfaction by the instance D′ of a set of FDs. Stable instances correspond to the intuitive notion of a clean database, in the sense that all the expected value identifications already take place in it. Although not explicitly developed in [11], for an instance D, if (D, D′ ) F M for a stable instance D′ , then D′ is expected to be reachable as a fix-point of an iteration of value identification updates that starts from D and is based on M . 3. MD SEMANTICS REVISITED Condition (b) in Definition 1 is used to avoid that the identification updates destroy the original similarities. Unfortunately, enforcing this strong requirement sometimes leads to counterintuitive results. Example 2. Consider the following instance D with stringvalued attributes, and MDs: R(D) A a a B c c C g ksp R[A] ≈ R[A] → R[C] ≈ S[E] → S(D) E h msp R[C] ⇋ R[C] R[B] ⇋ S[F ] F c c (3) (4) For two strings s1 and s2 , s1 ≈ s2 if the edit distance d between s1 and s2 satisfies d ≤ 1. To produce an instance D′ satisfying (D, D′ ) F M , the strings g and ksp must be changed to some common string s′ . Because of the similarities h ≈ g and ksp ≈ msp, s′ must be similar to the E attribute values of the tuples in S, by condition (b) of Definition 1 and MD (4). Clearly, there is no s′ that is similar to both h and msp. Therefore, at least one of h and msp must be modified to some new value in D′ . ✷ Another problem with the semantics of MDs is that it allows duplicate resolution in instances that are already resolved. Intuitively, there is no reason to change the values in an instance that is stable for a set of MDs M , because there is no reason to believe, on the basis of M , that these values are in error. However, even if an instance D satisfies (D, D) F M , it is always possible, by choosing different common values, to produce a different instance D′ such that (D, D′ ) F M . This is illustrated in the next example. Example 3. Let D be the instance below and the MD R[A] ≈ R[A] → R[B] ⇋ R[B]. R(D) A a a B c c Although D is stable, (D, D′ ) F m is true for any D′ where the B attribute values of the two tuples are the same. ✷ 3.1 MD satisfaction We now propose a new semantics for MD satisfaction that disallows unjustified attribute modifications. We keep condition (a) of Definition 1, while replacing condition (b) with a restriction on the possible updates that can be made. Definition 2. Let D be an instance of schema S, R ∈ S, tR ∈ R(D), C an attribute of R, and M a set of MDs. Value tR [C] is modifiable if there exist S ∈ S, tS ∈ S(D), an m ∈ M of the form R[Ā] ≈ S[B̄] → R[C̄] ⇋ S[Ē], and a corresponding pair (C, E) of (C̄, Ē), such that one of the following holds: 1. tR [Ā] ≈ tS [B̄], but tR [C] 6= tS [E]. 2. tR [Ā] ≈ tS [B̄] and tS [E] is modifiable. ✷ Example 4. Consider an instance D with two relations R and S with two MDs defined on it: R(D) A B S(D) C E t0 a0 b t3 a3 c t1 a1 b t4 a4 c t2 a2 b t5 a5 c m1 : R[A] ≈ R[A] → m2 : R[A] ≈ S[C] → m3 : S[C] ≈ S[C] → R[B] ⇋ R[B], R[B] ⇋ S[E], S[E] ⇋ S[E]. The following similarities hold on the distinct constants of R and S: ai ≈ a(i+1)mod6 , 0 ≤ i ≤ 5. The values t2 [B] and t3 [E] are modifiable by condition 1 of Definition 2, m2 , a2 ≈ a3 , and t2 [B] 6= t3 [E]. For the same reason, t0 [B] and t5 [E] are modifiable. Value t1 [B] is modifiable by condition 2 of Definition 2, m1 , a1 ≈ a2 , and the fact that t2 [B] is modifiable. Similarly, t4 [E] is modifiable. ✷ Definition 3. Let D, D′ be instances for S with the same tuple ids, and M a set of MDs. (D, D′ ) satisfies M , denoted (D, D′ ) M , iff: 1. For any pair of tuples tR ∈ R(D), tS ∈ S(D), if there exists an MD in M of the form R[Ā] ≈ S[B̄] → R[C̄] ⇋ S[Ē] and tR [Ā] ≈ tS [B̄], then for the corresponding tuples t′R ∈ R(D′ ) and t′S ∈ S(D′ ), it holds t′R [C̄] = t′S [Ē]. 2. For any tuple tR ∈ R(D) and any attribute G of R, if tR [G] is not modifiable, then t′R [G] = tR [G]. ✷ Condition 2. captures a natural default condition of persistence of values: Only those that have to be changed are changed. As before, we define stable instance for M to mean (D, D) M . Except where otherwise noted, these are the notions of satisfaction and stability that will be used in the rest of this paper. Example 5. Consider again example 4. The set of all D′ such that (D, D′ ) M is the set of all instances obtained from D by changing all values of R[B] and S[E] to a common value, and leaving all other values unchanged. This is because the values of R[B] and S[E] are the only modifiable values, and these values must be equal by condition 1. of Definition 3 and the given similarities. ✷ Theorem 2. Given an instance D and a set M of MDs, there always exists a resolved instance for D with respect to M. ✷ Example 6. Consider the following instance D of a relation R and set M of MDs: R(D) Condition 2. in Definition 3 on the set of updatable values does not prevent us from obtaining instances D′ that enforce the MD, as the following theorem establishes. Theorem 1. For any instance D and set of MDs M , there exists a D′ such that (D, D′ ) M . Moreover, for any attribute value that is changed from D to D′ , the new value can be chosen arbitrarily, as long as it is consistent with (D, D′ ) M . ✷ The new semantics introduced in Definition 3 solves the problems mentioned at the beginning of this section. Notice that it does not require additional changes to preserve similarities (if the original ones were broken). Furthermore, modifications of instances, unless required by the enforcement of matchings as specified by the MDs, are not allowed. Also notice that the instance D′ in Theorem 1 is not guaranteed to be stable. We address this issue in the next section. Moreover, as can be seen from the proof of Theorem 1, the new restriction imposed by Definition 3 is as strong as possible in the following sense: Any definition of MD satisfaction that includes condition 1. must allow the modification of the modifiable attributes (according to Definition 2). Otherwise, it is not possible to ensure, for arbitrary D, the existence of an instance D′ with (D, D′ ) M . 3.2 Resolved instances According to the MD semantics in [11], although not explicitly stated there, a clean version D′ of an instance D is an instance D′ satisfying the conditions (D, D′ ) |= M and (D′ , D′ ) |= M . Due to the natural restrictions on updates captured by the new semantics (cf. Definition 3), the existence of such a D′ is not guaranteed. Essentially, this is because D′ is the result of a series of updates. The MDs are applied to the original instance D to produce a new instance, which may have new pairs of similar values, forcing another application of the MDs, which in their turn produces another instance, and so on, until a stable instance D′ is reached. The pair (D, D′ ) may not satisfy M . However, we will be interested in those instances D′ just mentioned. The idea is to relax the condition (D, D′ ) M , and obtain a stable D′ after an iterative process of MD enforcement, which at each step, say k, makes sure that (Dk−1 , Dk ) |= M . Definition 4. Let D be a database instance and M a set of MDs. A resolved instance for D wrt M is an instance D′ , such that there is a finite (possibly empty) sequence of instances D1 , D2 , ...Dn with: (D, D1 ) M , (D1 , D2 ) M ,... (Dn−1 , Dn ) M , (Dn , D′ ) M , and (D′ , D′ ) M . ✷ Notice that, by Definition 3, for an instance D satisfying (D, D) |= M , it holds (D, D′ ) |= M if and only if D′ = D. In this case, the only possible set of intermediate instances is the empty set and D is the only resolved instance. Thus, a resolved instance cannot be obtained by making changes to an instance that is already resolved. A a a a R[A] ≈ R[A] → R[B] ≈ R[B] → B b c b C d e e R[B] ⇋ R[B], R[C] ⇋ R[C]. All pairs of distinct constants in R are dissimilar. Two resolved instances D1 and D2 of R are shown. R(D1 ) A a a a B b b b C d d d R(D2 ) A a a a B b b b C e e e Notice that (D, D1 ) 6|= M , because the value of the C attribute of the second tuple is not modifiable in D. This shows that some resolved instances cannot be obtained in a single update step, with updated instances as in Definition 4. ✷ The notion of resolved instance is one step towards the characterization of the intended clean instances. However, it still leaves room for refinement. Actually, the resolved instances that are of most interest for us are those that are somehow closest to the original instance. This consideration leads to the concept of minimal resolved instance, which uses as a measure of change the number of values that were modified to obtain the clean database. In Example 6, instance D2 is a minimal resolved instance, whereas D1 is not. Definition 5. Let D be an instance. (a) TD := {(t, A) | t is the id of a tuple in D and A is an attribute of the tuple}. (b) fD : TD → U is given by: fD (t, A) := the value for A in the tuple in D with id t. (c) For an instance D′ with the same tuple ids as D: ✷ SD,D ′ := {(t, A) ∈ TD | fD (t, A) 6= fD ′ (t, A)}. Intuitively, SD,D ′ is the set of all “positions” within the instance such that the value at that position was changed in going from D to D′ . Definition 6. Let D be an instance and M a set of MDs. A minimally resolved instance (MRI) of D wrt M is a resolved instance D′ such that |SD,D ′ | is minimum, i.e. there is no resolved instance D′′ with |SD,D ′′ | < |SD,D ′ |. We denote by Res(D, M ) the set of minimal resolved instances of D wrt the set M of MDs. ✷ Example 7. Consider the instance below and the MD R[A] ≈ S[C] → R[B] ⇋ S[D]. R A a1 B b1 S C c1 D d1 Assuming that a1 ≈ c1 , this instance has two minimal resolved instances, namely R R A a1 B d1 S A a1 B b1 S C c1 D d1 C c1 D b1 m1 m2 m3 ✷ Considering that MDs concentrate on changes of attribute values, we consider that this notion of minimality is appropriate. The comparisons have to be made at the attribute value level. In CQA a few other notions of minimality and comparison of instances have been investigated [3]. The requirement of Definition 6 that the number of changes be minimized in an MRI can be relaxed to allow MRIs whose change is within some percentage of the minimum without affecting any of the results presented here. This might be a more appropriate definition in certain duplicate resolution settings. In this subsection, we defined the MRIs, which we use as our model of a clean database instance. This leads to the definition of resolved answers to a query in the next subsection. Intuitively, these are the answers that are true in all MRIs. This is analogous to CQA, where a consistent answer to a query is defined as being true in (obtained from) all minimal repairs of a database that violates a set of integrity constraints [1]. Indeed, instead of applying the dynamic semantics of MDs to this context, we could have taken a more traditional approach, in which the MDs are interpreted as integrity constraints and the consistent answers are computed relative to these constraints. However, such an approach is not appropriate in this context, as the next example shows. Example 8. Consider the following instance of a relation R and MDs: R A B a b a d R[A] = R[A] → R[B] ⇋ R[B] where all pairs of distinct constants in R are unequal. Suppose we viewed the MDs as functional dependencies to be satisfied by R, replacing ⇋ with =. Consider the repairs (in the sense of CQA) that would be obtained via attribute modification [13, 14, 4, 12, 3], and minimality as in Definition 6. It follows immediately from Definition 6 that all MRIs are repairs. In the context of duplicate resolution, the appropriate way to repair R would be to set the values in the B column to a common value. However, one way of repairing the instance would be to change one of the values in the A column to b. In failing to restrict the allowed updates, an approach that defines MDs as traditional integrity constraints will lead to undesirable repairs and will not provide an appropriate semantics for the certain answers. ✷ 3.3 Resolved answers Let Q(x̄) be a query expressed in the first-order language L(S) associated to schema S. Now we are in position to characterize the admissible answers to Q from D, as those that are invariant under the matching resolution process. Definition 7. A tuple of constants ā is a resolved answer to Q(x̄) wrt the set M of MDs, denoted D |=M Q[ā], iff D′ |= Q[ā], for every D′ ∈ Res(D, M ). We denote with ResAn(D, Q, M ) the set of resolved answers to Q from D wrt M . ✷ Figure 1: An MD-Graph Example 9. (example 7 continued) The set of resolved answers to the query Q1 (x, y) : R(x, y) is empty since there are no tuples that are in the instance of R in all minimal resolved instances. On the other hand, the set of resolved answers to the query Q2 (x) : ∃y(R(x, y) ∧ (y = b1 ∨ y = d1 ) is {a1 }. ✷ In Section 4 we will study the complexity of the problem of computing the resolved answers, which we now formally introduce. Definition 8. Given a schema S, a query Q(x̄) ∈ L(S), and a set M of MDs, the Resolved Answer Problem (RAP) is the problem of deciding membership of the set RAQ,M := {(D, ā) | ā is a resolved answer to Q from instance D wrt M }. If Q is a boolean query, it is the problem of determining whether Q is true in all minimal resolved instances of D. ✷ 4. COMPUTING RESOLVED INSTANCES AND ANSWERS In this section, we consider the complexity of the RAQ,M problem introduced in the previous section. For this goal it is useful to associate a graph to the set of MDs. We need a few notions before introducing it. Definition 9. A set M of MDs is in standard form if no two MDs in M have the same expression to the left of the arrow. ✷ Notice that any set of MDs can be put in standard form by replacing subsets of MDs of the form {R[Ā] ≈ S[B̄] → R[C̄1 ] ⇋ S[Ē1 ], . . . , R[Ā] ≈ S[B̄] → R[C̄n ] ⇋ S[Ēn ]} by the single MD R[Ā] ≈ S[B̄] → R[C̄] ⇋ S[Ē], where the set of corresponding pairs of attributes of (C̄, Ē) is the union of those of (C̄1 , Ē1 ), ...(C̄n , Ēn ). From now on, we will assume that all sets of MDs are in standard form. For an MD m, LHS(m) and RHS(m) denote the sets of attributes that appear to the left side and to right side of the arrow, respectively. Definition 10. Let M be a set of MDs in standard form. The MD-graph of M , denoted MDG(M ), is a directed graph with a vertex labeled m for each T m ∈ M , and with an edge from m1 to m2 iff RHS(m1 ) LHS(m2 ) 6= ∅. ✷ Example 10. Consider the set of MDs: m1 : R[A] ≈ S[B] → R[C] ⇋ S[D]. m2 : R[C] ≈ S[D] → R[A] ⇋ S[B]. m3 : S[E] ≈ S[B] → T [F ] ⇋ T [F ]. It has the MD-graph shown in Figure 1. ✷ A set of MDs whose MD-graph contains edges is called interacting. Otherwise, it is non-interacting. We will use the following notions to discuss the tractability of computing resolved answers. Definition 11. Let D be a database instance and M a set of MDs. A triple (t, A, v), with (t, A) ∈ TD , is a certain ✷ triple if fD ′ (t, A) = v in all MRIs D′ of D. Definition 12. A set of MDs M defined on a database schema S is hard if it is an NP-hard problem to determine, given an instance D with schema S and set of certain triples P , whether P is the set of all certain triples. The set M is easy if this can be determined in polynomial time. ✷ RAP is intractable for hard sets of MDs even for some very simple queries. The following result is straightforward. Definition 13. For a query Q and set of MDs M defined on a schema S, RASetQ,M is the problem of determining, given an instance D with schema S and a set P of answers to Q, whether P is the set of all resolved answers to Q on D. ✷ In the case of interacting MDs, accidental similarities are a source of intractability for the computation of resolved answers. This is because accidental similarities produced by the application of one MD affect the application of other MDs, leading to a dependence on the choices of common values. Theorem 4. The following set of MDs is hard: R[A] ≈ R[A] → R[B] ≈ R[B] → Definition 14. An ordered pair (m1 , m2 ) of MDs is a linear pair of MDs if the MD graph of {m1 , m2 } has a single edge, and this edge is from m1 to m2 . ✷ For a pair of database instances D and D′ , (D, D′ ) |= M implies that certain groups of values in D must be set to a common value in D′ . Since all similarity operators subsume equality, these values are similar in D′ . We call similarities of D′ which hold in D or which are implied by (D, D′ ) |= M intended similarities. Other new similarities can also arise in the updated instance D′ , which we call accidental similarities. These similarities result from the particular choice of update value, and do not occur in all D′ satisfying (D, D′ ) |= M . Example 11. Consider the two-attribute relation R given below, and the MD R[A] = R[A] → R[B] ⇋ R[B]. R t1 : t2 : t3 : t4 : A a a b b B c e d f One possible updated instance is R t1 : t2 : t3 : t4 : A a a b b B d d d d Among the B attribute values, the intended similarities are t1 [B] = t2 [B] and t3 [B] = t4 [B], and the accidental similarities are t1 [B] = t3 [B], t1 [B] = t4 [B], t2 [B] = t3 [B], and t2 [B] = t4 [B]. ✷ ✷ Linear pairs of MDs nonetheless can sometimes be easy in spite of the occurrence of accidental similarities. This happens when the interaction between the MDs is more restricted in the sense that accidental similarities generated by one MD cannot affect the application of the other MD. For example, the set of MDs Theorem 3. Let M be a hard set of MDs defined on a schema S. Then there exists a single-atom conjunctive query Q on S for which RASetQ,M is NP-hard. ✷ It is straightforward to verify that all sets of non-interacting MDs are easy. We now turn to the simplest case of interacting MDs: A set M of two MDs such that MDG(M ) has a single directed edge from one vertex to the other. For the complexity results of this section, we make the assumption that, for all similarity operators, there exists an infinite set of pairwise dissimilar elements. R[B] ⇋ R[B] R[C] ⇋ R[C] R[A] ≈ R[A] → R[B] ⇋ R[B] R[A] ≈ R[A] ∧ R[B] ≈ R[B] → R[C] ⇋ R[C] is easy. Intuitively, the conjunct R[A] ≈ R[A] in the second MD “filters out” the accidental similarities among the values of attributes in the B column, allowing only the intended similarities to be passed on to the C column by the second MD. More generally, the following can be proved. Theorem 5. Any pair (m1 , m2 ) of linear MDs of the form m1 : R[Ā] ≈1 S[B̄] → R[C̄] ⇋ S[Ē] ¯ m2 : R[Ā] ≈1 S[B̄] ∧ R[F̄ ] ≈2 S[Ḡ] → R[H̄] ⇋ S[I] is easy. ✷ 5. CONCLUSIONS In this paper we have proposed a revised semantics for matching dependency (MD) satisfaction wrt the one originally proposed in [11]. The main outcomes from that semantics are the notions of minimally resolved instance (MRI) and resolved answers (RAs) to queries. The former capture the intended, clean instances obtained after enforcing the MDs on a given instance. The latter are query answers that persist across all the MRIs, and can be considered as robust and semantically correct answers. We investigated the new semantics, the MRIs and the RAs. We considered the existence of MRIs and derived some preliminary results on the complexity of computing the RAs. In our ongoing and future work, we are deriving syntactic criteria on MDs for identifying what we called easy and hard sets of MDs. We are also developing query rewriting methods for obtaining the RAs in the easy cases. In this paper we have not considered cases where the matchings of attribute values, whenever prescribed by the MDs’ conditions, are made according to matching functions [6]. This element adds an entirely new dimension to the semantics and the problems investigated here. It certainly deserves investigation. 6. REFERENCES [1] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In Proc. PODS, pages 68–79, 1999. [2] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. Euijong Whang, and J. Widom. Swoosh: A generic approach to entity resolution. VLDB Journal, 18(1):255–276, 2009. [3] L. Bertossi. Consistent query answering in databases. ACM Sigmod Record, 35(2):68–76, 2006. [4] L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Information Systems, 33(4):407–434, 2008. [5] L. Bertossi and J. Chomicki. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases, pages 43–83. Springer, 2003. [6] L. Bertossi, S. Kolahi, and L. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In Proc. ICDT, 2011. [7] J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1–41, 2008. [8] J. Chomicki. Consistent query answering: Five easy pieces. In Proc. ICDT, pages 1–17, 2007. [9] A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007. [10] W. Fan. Dependencies revisited for improving data quality. In Proc. PODS, pages 159–170, 2008. [11] W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. In Proc. VLDB, pages 407–418, 2009. [12] S. Flesca, F. Furfaro, and F. Parisi. Querying and repairing inconsistent numerical databases. ACM Trans. Database Syst., 2010, 35(2). [13] E. Franconi, A. Laureti Palma, N. Leone, S. Perri, and F. Scarcello. Census data repair: A challenging application of disjunctive logic programming. In Proc. LPAR, Springer LNCS 2250, 2001, pages 561–578. [14] J. Wijsen. Database repairing using updates. ACM Transactions on Database Systems, 30(3):722–768, 2005. APPENDIX A. AUXILIARY RESULTS AND PROOFS Proof of Theorem 1: Consider an undirected graph G whose vertices are labelled by pairs (t, A), where t is a tuple identifier and A is an attribute of t. There is an edge between two vertices (s, A) and (t, B) iff s and t satisfy the similarity condition of some MD m ∈ M such that A and B are matched by m. Update D as follows. Choose a vertex (t1 , A) such that there is another vertex (t2 , B) connected to (t1 , A) by an edge and t1 [A] and t2 [B] must be made equal to satisfy the equalities in condition 1. of Definition 3. For convenience in this proof, we say that t2 is unequal to t1 for such a pair of tuples t1 and t2 . Perform a breadth first search (BFS) on G starting with (t1 , A) as level 0. During the search, if a tuple is discovered at level i + 1 that is unequal to an adjacent tuple at level i, the value of the attribute in the former tuple is modified so that it matches that of the latter tuple. When the BFS has completed, another vertex with an adja- cent unequal tuple is chosen and another BFS is performed. This continues until no such vertices remain. It is clear that the resulting updated instance D′ satisfies condition 1. of Definition 3. We now show by induction on the levels of the breadth first searches that for all vertices (t, A) visited, t[A] is modifiable. This is true in the base case, by choice of the starting vertex. Suppose it is true for all levels up to and including the ith level. By definition of the graph G and condition 2. of Definition 2, the statement is true for all vertices at the (i+1)th level. This proves the first statement of the theorem. To prove the second statement, we show that, to satisfy condition 1. of Definition 3, the attribute values represented by each vertex in each connected component of G must be changed to a common value in the new instance. The statement then follows from the fact that the update algorithm can be modified so that the attribute value for the initial vertex in each BFS is updated to some arbitrary value at the start (since it is modifiable). By condition 1. of Definition 3, the pairs of values that must be equal in the updated instance D′ correspond to those vertices that are connected by an edge in G. This fact and transitivity of equality imply that all attribute values in a connected component must be updated to a common value. ✷ Proof of Theorem 2: We give an algorithm to compute a resolved instance, and use a monotonicity property to show that it always terminates. For attribute domain d in D, consider the set S d of pairs (t, A) such that attribute A of the tuple with identifier t has domain d. Let {S1 , S2 , ...Sn } be a partition of S d into sets such that all tuple/attribute pairs in a set have the same value in D. Define the level of (t, A) to mean |Sj | where (t, A) ∈ Sj . The algorithm first applies all MDs in M to D by setting equal pairs of unequal values according to the MDs. Specifically, consider a connected component C of the graph in the proof of Theorem 1. If the values of t[A] for all pairs (t, A) in C are not all the same, then their values are modified to a common value which is that of the pair with the highest level. This update is allowed by Theorem 1. In the case of a tie, the common value is chosen as the largest of the values according to some total ordering of the values from the domain that occur in the instance. It is easily verified that this operation increases the sum over all the levels of the elements of S d , where d is the domain of the attributes of the pairs in C. These updates produce an instance D1 such that (D, D1 ) M . The MDs of M are then applied to the instance D1 to obtain a new instance D2 such that (D1 , D2 ) M and so on, until a stable instance is reached. For each new instance, the sum over all domains d of the levels of the (t, A) ∈ S d is greater than for the previous instance. Since this quantity is bounded above, the algorithm terminates with a resolved instance. ✷ Proof of Theorem 4: The proof is by reduction from MONOTONE 3-SAT. Given an instance F of MONOTONE 3-SAT with clauses c1 , c2 ,...cn , construct an instance of R with tuples t1 , t2 ,...t3n . The sets {t1 , t2 , t3 }, {t4 , t5 , t6 },... are called 3-blocks. For 0 ≤ i < n, t3i+1 [A] = t3i+2 [A] = t3i+3 [A] = ki and for 0 ≤ i < n, the ki are pairwise dissimilar. We refer to a clause as a positive (negative) clause if it contains only positive (negative) literals. If ci is a positive clause, then t3(i−1)+1 [C] = t3(i−1)+2 [C] = t3(i−1)+3 [C] = a, and if ci is a negative clause (contains only negative literals), then t3(i−1)+1 [C] = t3(i−1)+2 [C] = t3(i−1)+3 [C] = b. The values in the B column consist of a set S of pairwise dissimilar values, one for each variable in F . The values of t3(i−1)+1 [B], t3(i−1)+2 [B], and t3(i−1)+3 [B] are the values in S corresponding to the variables in ci . In a resolved instance, the B attribute values of all tuples in a 3-block must be equal (because of the first MD). Minimal change in the B column is achieved by choosing as the common value any of the original B attribute values in the 3-block. We will show that there is a resolved instance with this choice of values for the B column and with no change to the values in the C column iff F is satisfiable. This is the only MRI when F is satisfiable. Thus, it is NP-hard to determine which values can change in an MRI, implying that the set of MDs is hard. In a satisfying assignment to F , there is a literal for each clause in F that is made true by the assignment. For each 3-block in F , choose as the common B attribute value the value corresponding to the true literal for a satisfying assignment. Since the assignment is consistent, the values chosen for the 3-blocks corresponding to positive clauses are dissimilar to those corresponding to negative clauses. This implies that the second (and final) update will set to a common value exactly those sets of values in the C column that had a common value in the original instance. Choosing the original values as the update values, there is no change in the C column. Conversely, suppose there is no satisfying assignment to F . Then, if a variable is chosen from each clause, there must be a negative and positive clause such that the same variable was chosen from them (otherwise F could be made true by setting the variables chosen from positive clauses true and setting those chosen from negative clauses false). This implies that, when update values are chosen so as to achieve minimal change in the B column in the first update, the second update must set to a common value sets of values in the C column that were originally distinct. Specifically, pairs of 3-blocks whose C attribute values were originally distinct must have their C attribute values set to a common value. ✷ For the next proof, we need an auxiliary definition and result. Definition 15. Let m be the MD R[Ā] ≈ S[B̄] → R[C̄] ⇋ S[Ē]. The transitive closure, T ≈ , of ≈ is the transitive closure of the binary relation on tuples t1 [Ā] ≈ t2 [B̄], where t1 ∈ R and t2 ∈ S. ✷ Lemma 1. Let D be an instance and let m be the MD in Definition 15. An instance D′ obtained by changing modifiable attribute values of D satisfies (D, D′ ) m iff for each equivalence class of T ≈ , there is a constant vector v̄ such that, for all tuples t in the equivalence class, t′ [C̄] = v̄ if t ∈ R(D) t′ [Ē] = v̄ if t ∈ S(D) where t′ is the tuple in D′ with the same identifier as t. Proof:Suppose (D, D′ ) m. By Definition 3, for each pair of tuples t1 ∈ R(D) and t2 ∈ S(D) such that t1 [Ā] ≈ t2 [B̄], t′1 [C̄] = t′2 [Ē] Therefore, if T ≈ (t̄1 , t̄2 ) is true, then t′1 and t′2 must be in the transitive closure of the binary relation expressed by t′1 [C̄] = t′2 [Ē]. But the transitive closure of this relation is the relation itself (because of the transitivity of equality). Therefore, t′1 [C̄] = t′2 [Ē]. The converse is trivial. ✷ Proof of Theorem 5: We prove the specific case in which the MDs have the form m1 : R[Ā] ≈1 S[B̄] → R[C] ⇋ S[E] m2 : R[Ā] ≈1 S[B̄] ∧ R[C] ≈2 S[E] → R[H] ⇋ S[I] A resolved instance is obtained after two updates. The pairs of tuples t1 ∈ R and t2 ∈ S such that the equality t1 [C] = t2 [E] is imposed by the first update are those that satisfy T ≈1 (t1 , t2 ), by lemma 1. The second update does not affect the values of the attributes R[C] and S[E], and the pairs of tuples t1 ∈ R and t2 ∈ S such that t1 [H] and t2 [I] are made equal are simply those that satisfy T ≈1 (t1 , t2 ). Furthermore, the relation T ≈1 (t1 , t2 ) subsumes the transitive closure of t1 [Ā] ≈1 t2 [B̄] ∧ t1 [C̄] ≈2 t2 [Ē] before the first update, and so the net effect of the two updates on the values of R[H] and S[I] is to impose t1 [H] = t2 [I] on pairs of tuples satisfying T ≈1 (t1 , t2 ). Since T ≈1 (t1 , t2 ) is computable in polynomial time, the result follows. ✷

Log In

Matching dependencies with arbitrary attribute values

Related papers

Related papers

Related topics