Skip to main content

Liat Peterfreund

École Normale Supérieure, Informatique, Faculty Member

Followers

3

Following

1

Mentions

1

Public Views

InterestsView All (6)

Uploads

Papers by Liat Peterfreund

A Researcher's Digest of GQL

HAL (Le Centre pour la Communication Scientifique Directe), Mar 28, 2023

GQL (Graph Query Language) is being developed as a new ISO standard for graph query languages to ... more GQL (Graph Query Language) is being developed as a new ISO standard for graph query languages to play the same role for graph databases as SQL plays for relational. In parallel, an extension of SQL for querying property graphs, SQL/PGQ, is added to the SQL standard; it shares the graph pattern matching functionality with GQL. Both standards (not yet published) are hard-to-understand specifications of hundreds of pages. The goal of this paper is to present a digest of the language that is easy for the research community to understand, and thus to initiate research on these future standards for querying graphs. The paper concentrates on pattern matching features shared by GQL and SQL/PGQ, as well as querying facilities of GQL.

Diversity and Inclusion Activities in Database Conferences: A 2021 Report

HAL (Le Centre pour la Communication Scientifique Directe), 2022

Handling SQL Nulls with Two-Valued Logic

HAL (Le Centre pour la Communication Scientifique Directe), Jan 8, 2021

The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic ... more The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness, but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that, contrary to the widely held view, SQL could have been designed based on the standard Boolean logic, without any loss of expressiveness and without giving up nulls. The approach itself follows SQL's evaluation which only retains tuples for which conditions in the WHERE clause evaluate to true. We show that conflating unknown, resulting from nulls, with false leads to an equally expressive version of SQL that does not use the third truth value. Queries written under the two-valued semantics can be efficiently translated into the standard SQL and thus executed on any existing RDBMS. These results cover the core of the SQL 1999 Standard, including SELECT-FROM-WHERE-GROUP BY-HAVING queries extended with subqueries and IN/EXISTS/ANY/ALL conditions, and recursive queries. We provide two extensions of this result showing that no other way of converting 3VL into Boolean logic, nor any other many-valued logic for treating nulls could have possibly led to a more expressive language. These results not only present small modifications of SQL that eliminate the source of many programmer errors without the need to reimplement database internals, but they also strongly suggest that new query languages for various data models do not have to follow the much criticized SQL's three-valued approach.

Complexity Bounds for Relational Algebra over Document Spanners

arXiv (Cornell University), Jan 14, 2019

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations... more We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless" semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression. 2 This is the spanner analog of a recent line of work on the enumeration complexity of database and string queries [2, 3, 21, 28].

Diversity, Equity and Inclusion Activities in Database Conferences: A 2022 Report

ACM SIGMOD Record

The Diversity, Equity and Inclusion (DEI) initiative started as the Diversity/Inclusion initiativ... more The Diversity, Equity and Inclusion (DEI) initiative started as the Diversity/Inclusion initiative in 2020 [4]. The current report summarizes our activities in 2022. Our responsibility as a community is to ensure that attendees of DB conferences feel included, irrespective of their scientific perspective and personal background. One of the first steps was to establish the role of the DEI chairs at DB Conferences, with the DEI team dedicated to providing leadership to help our community achieve this goal. In this leadership role, the DEI team is advising DEI chairs at DB conferences, serving as a memory of DEI events at conferences, building an agreed-upon vision, and committing to working together to devise a set of measures for achieving DEI. That is pursued via actions led by our core members (Figure 1) and liaisons of individual executive bodies (Figure 2): REACH OUT collects data and experiences from our community. INCLUDE monitors and recommends inclusion efforts. ORGANIZE focu...

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text)... more We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called extraction grammars, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial ...

Grammars for Document Spanenrs

A new grammar-based language for defining information-extractors from textual content based on th... more A new grammar-based language for defining information-extractors from textual content based on the document spanners framework of Fagin et al.~is proposed. While studied languages for document spanners are mainly built upon regex formulas, which are regular expressions extended with variables, this new language is based on context-free grammars. The expressiveness of these grammars is compared with previously studied classes of spanners and the complexity of their evaluation is discussed. An enumeration algorithm that outputs the results with constant delay after cubic preprocessing in the input document is presented.

Joining Extractions of Regular Expressions

Regular expressions with capture variables, also known as "regex formulas," extract rel... more Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bou...

Recursive Programs for Document Spanners

A document spanner models a program for Information Extraction (IE) as a function that takes as i... more A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are obtained by adding capture variables to regular expressions. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (extracting relations that play the role of EDBs from the input document). In this paper, we investigate the expressive power of recursive Datalog over regex formulas. Our main result is that such programs capture precisely the document spanners computable in polynomial time. Additional results compare recursive programs to known formalisms such as the language of core spanners (that extends regular spanners by allowing to test for string equality) and its cl...

Detecting Ambiguity in Prioritized Database Repairing

In its traditional definition, a repair of an inconsistent database is a consistent database that... more In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way." Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense. Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion. In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there i...

Joining Extractions of Regular Expressions

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2018

Regular expressions with capture variables, also known as "regex formulas,'' extract... more Regular expressions with capture variables, also known as "regex formulas,'' extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners," Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intra...

Finite models and the theory of concatenation

We propose FC, a logic on words that combines the previous approaches of finite-model theory and ... more We propose FC, a logic on words that combines the previous approaches of finite-model theory and the theory of concatenation, and that has immediate applications in information extraction and database theory in the form of document spanners. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its subwords. As a consequence of this, FC has many of the desirable properties of FO[<], while being far more expressive. Most noteworthy among these desirable properties are sufficient criteria for efficient model checking and capturing various complexity classes by extending the logic with appropriate closure or iteration operators. These results allows us to obtain new insights into and techniques for the expressive power and efficient evaluation of document spanners. In fact, FC provides us with a general framework for reasoning about words that has poten...

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text)... more We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of Document Spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called extraction grammars, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial ...

Detecting Ambiguity in Prioritized Database Repairing

In its traditional definition, a repair of an inconsistent database is a consistent database that... more In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way." Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense. Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion. In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there i...

Closure Under Reversal of Languages over Infinite Alphabets

It is shown that languages definable by weak pebble automata are not closed under reversal. For t... more It is shown that languages definable by weak pebble automata are not closed under reversal. For the proof, we establish a kind of periodicity of an automaton’s computation over a specific set of words. The periodicity is partly due to the finiteness of the automaton description and partly due to the word’s structure. Using such a periodicity we can find a word such that during the automaton’s run on it there are two different, yet indistinguishable, configurations. This enables us to remove a part of that word without affecting acceptance. Choosing an appropriate language leads us to the desired result.

Recognizing Determinism in Prioritized Repairing of Inconsistent Databases

A repair of an inconsistent database is traditionally defined as a consistent database that diffe... more A repair of an inconsistent database is traditionally defined as a consistent database that differs from the inconsistent one in a “minimal way.” As there are often reasons to prefer one repair over another, researchers have introduced and investigated the framework of preferred repairs, where a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to ones that are optimal in the lifted sense. In this paper we describe our recent results on the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there is exactly one optimal repair. In particular, we show that different conventional semantics of priority lifting entail highly different complexities.

Complexity Bounds for Relational Algebra over Document Spanners

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2019

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations... more We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless'' semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between t...

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information extraction from text as a fu... more The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata - a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring oper...

Handling SQL Nulls with Two-Valued Logic

ArXiv, 2020

The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic ... more The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness, but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that, contrary to the widely held view, SQL could have been designed based on the standard Boolean logic, without any loss of expressiveness and without giving up nulls. The approach itself follows SQL’s evaluation which only retains tuples for which conditions in the WHERE clause evaluate to true. We show that conflating unknown, resulting from nulls, with false leads to an equally expressive version of SQL that does not use the third truth value. Queries written under the two-valued semantics can be efficiently translated into the standard SQL and thus executed on any existing RDBMS. These ...

Recursive Programs for Document Spanners

A document spanner models a program for Information Extraction (IE) as a function that takes as i... more A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are obtained by adding capture variables to regular expressions. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (extracting relations that play the role of EDBs from the input document). In this paper, we investigate the expressive power of recursive Datalog over regex formulas. Our main result is that such programs capture precisely the document spanners computable in polynomial time. Additional results compare recursive programs to known formalisms such as the language of core spanners (that extends regular spanners by allowing to test for string equality) and its cl...

A Researcher's Digest of GQL

HAL (Le Centre pour la Communication Scientifique Directe), Mar 28, 2023

GQL (Graph Query Language) is being developed as a new ISO standard for graph query languages to ... more GQL (Graph Query Language) is being developed as a new ISO standard for graph query languages to play the same role for graph databases as SQL plays for relational. In parallel, an extension of SQL for querying property graphs, SQL/PGQ, is added to the SQL standard; it shares the graph pattern matching functionality with GQL. Both standards (not yet published) are hard-to-understand specifications of hundreds of pages. The goal of this paper is to present a digest of the language that is easy for the research community to understand, and thus to initiate research on these future standards for querying graphs. The paper concentrates on pattern matching features shared by GQL and SQL/PGQ, as well as querying facilities of GQL.

Diversity and Inclusion Activities in Database Conferences: A 2021 Report

HAL (Le Centre pour la Communication Scientifique Directe), 2022

Handling SQL Nulls with Two-Valued Logic

HAL (Le Centre pour la Communication Scientifique Directe), Jan 8, 2021

The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic ... more The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness, but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that, contrary to the widely held view, SQL could have been designed based on the standard Boolean logic, without any loss of expressiveness and without giving up nulls. The approach itself follows SQL's evaluation which only retains tuples for which conditions in the WHERE clause evaluate to true. We show that conflating unknown, resulting from nulls, with false leads to an equally expressive version of SQL that does not use the third truth value. Queries written under the two-valued semantics can be efficiently translated into the standard SQL and thus executed on any existing RDBMS. These results cover the core of the SQL 1999 Standard, including SELECT-FROM-WHERE-GROUP BY-HAVING queries extended with subqueries and IN/EXISTS/ANY/ALL conditions, and recursive queries. We provide two extensions of this result showing that no other way of converting 3VL into Boolean logic, nor any other many-valued logic for treating nulls could have possibly led to a more expressive language. These results not only present small modifications of SQL that eliminate the source of many programmer errors without the need to reimplement database internals, but they also strongly suggest that new query languages for various data models do not have to follow the much criticized SQL's three-valued approach.

Complexity Bounds for Relational Algebra over Document Spanners

arXiv (Cornell University), Jan 14, 2019

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations... more We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless" semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression. 2 This is the spanner analog of a recent line of work on the enumeration complexity of database and string queries [2, 3, 21, 28].

Diversity, Equity and Inclusion Activities in Database Conferences: A 2022 Report

ACM SIGMOD Record

The Diversity, Equity and Inclusion (DEI) initiative started as the Diversity/Inclusion initiativ... more The Diversity, Equity and Inclusion (DEI) initiative started as the Diversity/Inclusion initiative in 2020 [4]. The current report summarizes our activities in 2022. Our responsibility as a community is to ensure that attendees of DB conferences feel included, irrespective of their scientific perspective and personal background. One of the first steps was to establish the role of the DEI chairs at DB Conferences, with the DEI team dedicated to providing leadership to help our community achieve this goal. In this leadership role, the DEI team is advising DEI chairs at DB conferences, serving as a memory of DEI events at conferences, building an agreed-upon vision, and committing to working together to devise a set of measures for achieving DEI. That is pursued via actions led by our core members (Figure 1) and liaisons of individual executive bodies (Figure 2): REACH OUT collects data and experiences from our community. INCLUDE monitors and recommends inclusion efforts. ORGANIZE focu...

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text)... more We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called extraction grammars, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial ...

Grammars for Document Spanenrs

A new grammar-based language for defining information-extractors from textual content based on th... more A new grammar-based language for defining information-extractors from textual content based on the document spanners framework of Fagin et al.~is proposed. While studied languages for document spanners are mainly built upon regex formulas, which are regular expressions extended with variables, this new language is based on context-free grammars. The expressiveness of these grammars is compared with previously studied classes of spanners and the complexity of their evaluation is discussed. An enumeration algorithm that outputs the results with constant delay after cubic preprocessing in the input document is presented.

Joining Extractions of Regular Expressions

Regular expressions with capture variables, also known as "regex formulas," extract rel... more Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bou...

Recursive Programs for Document Spanners

A document spanner models a program for Information Extraction (IE) as a function that takes as i... more A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are obtained by adding capture variables to regular expressions. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (extracting relations that play the role of EDBs from the input document). In this paper, we investigate the expressive power of recursive Datalog over regex formulas. Our main result is that such programs capture precisely the document spanners computable in polynomial time. Additional results compare recursive programs to known formalisms such as the language of core spanners (that extends regular spanners by allowing to test for string equality) and its cl...

Detecting Ambiguity in Prioritized Database Repairing

In its traditional definition, a repair of an inconsistent database is a consistent database that... more In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way." Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense. Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion. In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there i...

Joining Extractions of Regular Expressions

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2018

Regular expressions with capture variables, also known as "regex formulas,'' extract... more Regular expressions with capture variables, also known as "regex formulas,'' extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners," Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intra...

Finite models and the theory of concatenation

We propose FC, a logic on words that combines the previous approaches of finite-model theory and ... more We propose FC, a logic on words that combines the previous approaches of finite-model theory and the theory of concatenation, and that has immediate applications in information extraction and database theory in the form of document spanners. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its subwords. As a consequence of this, FC has many of the desirable properties of FO[<], while being far more expressive. Most noteworthy among these desirable properties are sufficient criteria for efficient model checking and capturing various complexity classes by extending the logic with appropriate closure or iteration operators. These results allows us to obtain new insights into and techniques for the expressive power and efficient evaluation of document spanners. In fact, FC provides us with a general framework for reasoning about words that has poten...

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text)... more We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of Document Spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called extraction grammars, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial ...

Detecting Ambiguity in Prioritized Database Repairing

In its traditional definition, a repair of an inconsistent database is a consistent database that... more In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way." Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense. Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion. In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there i...

Closure Under Reversal of Languages over Infinite Alphabets

It is shown that languages definable by weak pebble automata are not closed under reversal. For t... more It is shown that languages definable by weak pebble automata are not closed under reversal. For the proof, we establish a kind of periodicity of an automaton’s computation over a specific set of words. The periodicity is partly due to the finiteness of the automaton description and partly due to the word’s structure. Using such a periodicity we can find a word such that during the automaton’s run on it there are two different, yet indistinguishable, configurations. This enables us to remove a part of that word without affecting acceptance. Choosing an appropriate language leads us to the desired result.

Recognizing Determinism in Prioritized Repairing of Inconsistent Databases

A repair of an inconsistent database is traditionally defined as a consistent database that diffe... more A repair of an inconsistent database is traditionally defined as a consistent database that differs from the inconsistent one in a “minimal way.” As there are often reasons to prefer one repair over another, researchers have introduced and investigated the framework of preferred repairs, where a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to ones that are optimal in the lifted sense. In this paper we describe our recent results on the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there is exactly one optimal repair. In particular, we show that different conventional semantics of priority lifting entail highly different complexities.

Complexity Bounds for Relational Algebra over Document Spanners

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2019

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations... more We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless'' semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between t...

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information extraction from text as a fu... more The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata - a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring oper...

Handling SQL Nulls with Two-Valued Logic

ArXiv, 2020

The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic ... more The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness, but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that, contrary to the widely held view, SQL could have been designed based on the standard Boolean logic, without any loss of expressiveness and without giving up nulls. The approach itself follows SQL’s evaluation which only retains tuples for which conditions in the WHERE clause evaluate to true. We show that conflating unknown, resulting from nulls, with false leads to an equally expressive version of SQL that does not use the third truth value. Queries written under the two-valued semantics can be efficiently translated into the standard SQL and thus executed on any existing RDBMS. These ...

Recursive Programs for Document Spanners

A document spanner models a program for Information Extraction (IE) as a function that takes as i... more A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are obtained by adding capture variables to regular expressions. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (extracting relations that play the role of EDBs from the input document). In this paper, we investigate the expressive power of recursive Datalog over regex formulas. Our main result is that such programs capture precisely the document spanners computable in polynomial time. Additional results compare recursive programs to known formalisms such as the language of core spanners (that extends regular spanners by allowing to test for string equality) and its cl...