Several recent papers argue for approximate lookups in hierarchical data and propose index struct... more Several recent papers argue for approximate lookups in hierarchical data and propose index structures that support approximate searches in large sets of hierarchical data. These index structures must be updated if the underlying data changes. Since the performance of a full index reconstruction is prohibitive, the index must be updated incrementally. We propose a persistent and incrementally maintainable index for approximate lookups in hierarchical data. The index is based on small tree patterns, called pq-grams. It supports efficient updates in response to structure and value changes in hierarchical data and is based on the log of tree edit operations. We prove the correctness of the incremental maintenance for sequences of edit operations. Our algorithms identify a small set of pq-grams that must be updated to maintain the index. The experimental results with synthetic and real data confirm the scalability of our approach.
The paper presents a data cleansing technique for string databases. We propose and evaluate an al... more The paper presents a data cleansing technique for string databases. We propose and evaluate an algorithm that identifies a group of strings that consists of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced by the most frequent string of this group. Our method targets proper noun databases, including names and addresses, which are not handled by dictionaries. At the technical level we give an efficient solution for computing the center of a group of strings and determine the border of the group. We use inverse strings together with sampling to efficiently identify and cleanse a database. The experimental evaluation shows that for proper nouns the center calculation and border detection algorithms are robust and even very small sample sizes yield good results.
Many applications have to deal with a large amount of data which not only represent the perceived... more Many applications have to deal with a large amount of data which not only represent the perceived state of the real world at present, but also past and/or future states. This type of time-varying data arises for example in banking and insurance applications, in health care, in planning and scheduling problems, and in ticket reservation systems. These applications are not served adequately by today's database systems. In particular, deletions and updates in such systems have destructive semantics. This means that previous database contents (repre¬ senting previous perceived states of the real world) cannot be accessed anymore. Moreover, the formulation of queries which access different database states is often tedious and hence error prone without special support for the temporal dimension. Temporal database systems have been developed to provide this kind of support. They assume that all data values are associated with timestamps which repre¬ sent the time during which this data...
We enhance constrained-based data quality with approximate band conditional order dependencies (a... more We enhance constrained-based data quality with approximate band conditional order dependencies (abcODs). Band ODs model the semantics of attributes that are monotonically related with small variations without there being an intrinsic violation of semantics. The class of abcODs generalizes band ODs to make them more relevant to real-world applications by relaxing them to hold approximately (abODs) with some exceptions and conditionally (bcODs) on subsets of the data. We study the problem of automatic dependency discovery over a hierarchy of abcODs. First, we propose a more efficient algorithm to discover abODs than in recent prior work. The algorithm is based on a new optimization to compute a longest monotonic band (longest subsequence of tuples that satisfy a band OD) through dynamic programming by decreasing the runtime from O(n 2) to O(n log n) time. We then illustrate that while the discovery of bcODs is relatively straightforward, there exist codependencies between approximation and conditioning that make the problem of abcOD discovery challenging. The naive solution is prohibitively expensive as it considers all possible segmentations of tuples resulting in exponential time complexity. To reduce the search space, we devise a dynamic programming algorithm for abcOD discovery that determines the optimal solution in O(n 3 log n) complexity. To further optimize the performance, we adapt the algorithm to cheaply identify consecutive tuples that are guaranteed to belong to the same band-this improves the performance significantly in practice without losing optimality. While unidirectional abcODs are most common in practice, for generality we extend our algorithms with both ascending and descending orders to discover bidirectional abcODs. Finally, we perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets.
We establish an exact correspondence between temporal logic and a subset of TSQL2, a consensus te... more We establish an exact correspondence between temporal logic and a subset of TSQL2, a consensus temporal extension of SQL-92. The translation from temporal logic to TSQLZ developed here enables a user to write high-level queries which can be evaluated against a spaceefficient representation of the database. The reverse translation, also provided, makes it possible to characterize the expressive power of TSQLB. We demonstrate that temporal logic is equal in expressive power to a syntactically defined subset of TSQL2.
Page 1. Querying Multi-granular Compact Representations Rom¯ans Kasperovics and Michael Böhlen Fr... more Page 1. Querying Multi-granular Compact Representations Rom¯ans Kasperovics and Michael Böhlen Free University of Bozen - Bolzano, Dominikanerplatz - P.zza Domenicani 3, 39100 Bozen - Bolzano, Italy Abstract. A common ...
Business Intelligence solutions, encompassing technologies such as multi-dimensional data modelin... more Business Intelligence solutions, encompassing technologies such as multi-dimensional data modeling and aggregate query processing, are being applied increasingly to non-traditional data. This paper extends multi-dimensional aggregation to apply to data with associated interval values that capture when the data hold. In temporal databases, intervals typically capture the states of reality that the data apply to, or capture when the data are, or were, part of the current database state. This paper proposes a new aggregation operator that addresses several challenges posed by interval data. First, the intervals to be associated with the result tuples may not be known in advance, but depend on the actual data. Such unknown intervals are accommodated by allowing result groups that are specified only partially. Second, the operator contends with the case where an interval associated with data expresses that the data holds for each point in the interval, as well as the case where the data holds only for the entire interval, but must be adjusted to apply to sub-intervals. The paper reports on an implementation of the new operator and on an empirical study that indicates that the operator scales to large data sets and is competitive with respect to other temporal aggregation algorithms.
Page 1. On the Completeness of Temporal Database Query Languages Michael Bthlen and Robert Marti ... more Page 1. On the Completeness of Temporal Database Query Languages Michael Bthlen and Robert Marti Institut fiir Informationssysteme, ETH Ziirich 8092 Ziirich, Switzerland Email: (boehlen, marti)~inf.ethz.ch Abstract. In this ...
Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems - GIS '07, 2007
Many database applications deal with spatio-temporal phenomena, and during the last decade a lot ... more Many database applications deal with spatio-temporal phenomena, and during the last decade a lot of research targeted locationbased services, moving objects, traffic jam preventions, meteorology, etc. In strong contrast, there exist only very few proposals for an implementation of a spatio-temporal database system let alone a web-based spatio-temporal information system. This paper describes the design and implementation of a webbased spatio-temporal information system. The system uses Secondo as spatio-temporal DBMS for handling moving objects and MapServer as an OGC-compliant rendering engine for static spatial data. We describe the architecture of the system and compare our system with a standalone application. The paper investigates in detail issues that arise in the context of the web. First, we describe an implementation of a lightweight client that takes advantage of the functionality offered by Secondo and MapServer. Second, we describe how moving objects can be represented in GML. We discuss possible GML representations, propose an extension of GML that uses 3D segments (2D location + time) to represent moving objects, and present experiments that compare the solutions.
Proceedings 14th International Conference on Data Engineering
The association of timestamps with various data items such as tuples or attribute values is funda... more The association of timestamps with various data items such as tuples or attribute values is fundamental to the management of time-varying information. Using intervals in timestamps, as do most data models, leaves a data model with a variety of choices for giving a meaning to timestamps. Specifically, some such data models claim to be point-based while other data models claim to be interval-based. The meaning chosen for timestamps is important-it has a pervasive effect on most aspects of a data model, including database design, a variety of query language properties, and query processing techniques, e.g., the availability of query optimization opportunities. This paper precisely defines the notions of point-based and interval-based temporal data models, thus providing a new, formal basis for characterizing temporal data models and obtaining new insights into the properties of their query languages. Queries in point-based models treat snapshot equivalent argument relations identically. This renders point-based models insensitive to coalescing. In contrast, queries in interval-based models give significance to the actual intervals used in the timestamps, thus generally treating non-identical, but possibly snapshot equivalent, relations differently. The paper identifies the notion of timefragment preservation as the essential defining property of an interval-based data model.
Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337), 1999
In order to better support current and new applications, the major DBMS vendors are stepping beyo... more In order to better support current and new applications, the major DBMS vendors are stepping beyond uninterpreted binary large objects, termed BLOBs, and are beginning to offer extensibility features that allow external developers to extend the DBMS with, e.g., their own data types and accompanying access methods. Existing solutions include DB2 extenders, Informix DataBlades, and Oracle cartridges. Extensible systems offer new and exciting opportunities for researchers and third-party developers alike. This paper reports on an implementation of an Informix DataBlade for the GR-tree, a new R-tree based index. This effort represents a stress test of the perhaps currently most extensible DBMS, in that the new DataBlade aims to achieve better performance, not just to add functionality. The paper provides guidelines for how to create an access method DataBlade, describes the sometimes surprising challenges that must be negotiated during DataBlade development, and evaluates the extensibility of the Informix Dynamic Server.
Database Systems for Advanced Applications '97 - Proceedings of the Fifth International Conference on Database Systems for Advanced Applications, 1997
A wide range of database applications manage timevarying data, and it is well-known that querying... more A wide range of database applications manage timevarying data, and it is well-known that querying and correctly updating time-varying data is dificult and error-prone when using standard SQL. Temporal extensions of SQL ofSeer substantial benefits over SQL when managing time-varying data. The topic of this paper is the effective implementation of temporally extended SQL's. Traditionally, it has been assumed that a temporal DBMS must be built from scratch, utilizing new technologies for storage, indexing, query optimization, concurrency control, and recovery. In contrast, this paper explores the concepts and techniques involved in implementing a temporally enhanced SQL while maximally reusing the facilities of an existing SQL implementation. The topics covered span the choice of an adequate timestamp domain that includes the time van'able "NOW," a comparison. of query processing architectures, and transaction processing, the latter including how to ensure ACID properties and assign timestamps to updates.
Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e s... more Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e series publishes 50-to 125 page publications on topics pertaining to data management. e scope will largely follow the purview of premier information and computer science conferences, such as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not are limited to: query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.
Proceedings of the 1998 ACM symposium on Applied Computing - SAC '98, 1998
In areas such as finance, marketing, and property and resource management, many database applicat... more In areas such as finance, marketing, and property and resource management, many database applications manage spatio-temporal data. These applications typically run on top of a relational DBMS and manage spatio-temporal data either using the DBMS, which provides little support, or employ the services of a proprietary system that co-exists with the DBMS, but is separate from and not integrated with the DBMS. This wealth of applications may benefit substantially from built-in, integrated spatio-temporal DBMS support. Providing a foundation for such support is an important and substantial challenge. This paper initially defines technical requirements to a spatio-temporal DBMS aimed at protecting business investments in the existing legacy applications and at reusing personnel expertise. These requirements provide a foundation for making it economically feasible to migrate legacy applications to a spatio-temporal DBMS. The paper next presents the design of the core of a spatio-temporal, multi-dimensional extension to SQL-92, called STSQL, that satisfies the requirements. STSQL does so by supporting so-called upward compatible, dimensional upward compatible, reducible, and non-reducible queries. In particular, dimensional upward compatibility and reducibility were designed to address migration concerns and complement proposals based on abstract data types.
We define, compute, and evaluate nested surfaces for the purpose of visual data mining. Nested su... more We define, compute, and evaluate nested surfaces for the purpose of visual data mining. Nested surfaces enclose the data at various density levels, and make it possible to equalize the more and less pronounced structures in the data. This facilitates the detection of multiple structures, which is important for data mining where the less obvious relationships are often the most interesting ones. The experimental results illustrate that surfaces are fairly robust with respect to the number of observations, easy to perceive, and intuitive to interpret. We give a topology-based definition of nested surfaces and establish a relationship to the density of the data. Several algorithms are given that compute surface grids and surface contours, respectively.
Large parts of today's data is stored in text documents that undergo a series of changes during t... more Large parts of today's data is stored in text documents that undergo a series of changes during their lifetime. For instance during the development of a software product the source code changes frequently. Currently, managing such data relies on version control systems (VCSs). Extracting information from large documents and their different versions is a manual and tedious process. We present Qvestor, a system that allows to declaratively query documents. It leverages information about the structure of a document that is available as a context-free grammar and allows to declaratively query document versions through a grammar annotated with relational algebra expressions. We define and illustrate the annotation of grammars with relational algebra expressions and show how to translate the annotations to easy to use SQL views.
Temporal aggregation is an important operationin temporal databases, and different variants there... more Temporal aggregation is an important operationin temporal databases, and different variants thereof havebeen proposed. In this paper, we introduce a novel temporalaggregation operator, termed parsimonious temporal aggregation (PTA), that overcomes major limitations of existingapproaches. PTA takes the result of instant temporal aggregation (ITA) of size n, which might be up to twice as largeas the argument relation, and merges similar tuples untila given error () or size (c) bound is reached. The newoperator is data-adaptive and allows the user to control thetrade-off between the result size and the error introduced bymerging. For the precise evaluation of PTA queries, we propose two dynamic programmingbased algorithms for sizeand error-bounded queries, respectively, with a worst-casecomplexity that is quadratic in n. We present two optimizations that take advantage of temporal gaps and differentaggregation groups and achieve a linear runtime in experiments with real-world data. For the quick computation ofan approximate PTA answer, we propose an efficient greedymerging strategy with a precision that is upper bounded byO(log n). We present two algorithms that implement thisstrategy and begin to merge as ITA tuples are produced. Theyrequire O(n log(c +)) time and O(c +) space, where is the size of a read-ahead buffer and is typically very small. An empirical evaluation on real-world and synthetic datashows that PTA considerably reduces the size of the aggregation result, yet introducing only small errors. The greedyalgorithms are scalable for large data sets and introduce lesserror than other approximation techniques.
Page 1. The VLDB Journal DOI 10.1007/s00778-012-0263-0 REGULAR PAPER Measuring structural similar... more Page 1. The VLDB Journal DOI 10.1007/s00778-012-0263-0 REGULAR PAPER Measuring structural similarity of semistructured data based on information-theoretic approaches Sven Helmer · Nikolaus Augsten · Michael Böhlen ...
Several recent papers argue for approximate lookups in hierarchical data and propose index struct... more Several recent papers argue for approximate lookups in hierarchical data and propose index structures that support approximate searches in large sets of hierarchical data. These index structures must be updated if the underlying data changes. Since the performance of a full index reconstruction is prohibitive, the index must be updated incrementally. We propose a persistent and incrementally maintainable index for approximate lookups in hierarchical data. The index is based on small tree patterns, called pq-grams. It supports efficient updates in response to structure and value changes in hierarchical data and is based on the log of tree edit operations. We prove the correctness of the incremental maintenance for sequences of edit operations. Our algorithms identify a small set of pq-grams that must be updated to maintain the index. The experimental results with synthetic and real data confirm the scalability of our approach.
The paper presents a data cleansing technique for string databases. We propose and evaluate an al... more The paper presents a data cleansing technique for string databases. We propose and evaluate an algorithm that identifies a group of strings that consists of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced by the most frequent string of this group. Our method targets proper noun databases, including names and addresses, which are not handled by dictionaries. At the technical level we give an efficient solution for computing the center of a group of strings and determine the border of the group. We use inverse strings together with sampling to efficiently identify and cleanse a database. The experimental evaluation shows that for proper nouns the center calculation and border detection algorithms are robust and even very small sample sizes yield good results.
Many applications have to deal with a large amount of data which not only represent the perceived... more Many applications have to deal with a large amount of data which not only represent the perceived state of the real world at present, but also past and/or future states. This type of time-varying data arises for example in banking and insurance applications, in health care, in planning and scheduling problems, and in ticket reservation systems. These applications are not served adequately by today's database systems. In particular, deletions and updates in such systems have destructive semantics. This means that previous database contents (repre¬ senting previous perceived states of the real world) cannot be accessed anymore. Moreover, the formulation of queries which access different database states is often tedious and hence error prone without special support for the temporal dimension. Temporal database systems have been developed to provide this kind of support. They assume that all data values are associated with timestamps which repre¬ sent the time during which this data...
We enhance constrained-based data quality with approximate band conditional order dependencies (a... more We enhance constrained-based data quality with approximate band conditional order dependencies (abcODs). Band ODs model the semantics of attributes that are monotonically related with small variations without there being an intrinsic violation of semantics. The class of abcODs generalizes band ODs to make them more relevant to real-world applications by relaxing them to hold approximately (abODs) with some exceptions and conditionally (bcODs) on subsets of the data. We study the problem of automatic dependency discovery over a hierarchy of abcODs. First, we propose a more efficient algorithm to discover abODs than in recent prior work. The algorithm is based on a new optimization to compute a longest monotonic band (longest subsequence of tuples that satisfy a band OD) through dynamic programming by decreasing the runtime from O(n 2) to O(n log n) time. We then illustrate that while the discovery of bcODs is relatively straightforward, there exist codependencies between approximation and conditioning that make the problem of abcOD discovery challenging. The naive solution is prohibitively expensive as it considers all possible segmentations of tuples resulting in exponential time complexity. To reduce the search space, we devise a dynamic programming algorithm for abcOD discovery that determines the optimal solution in O(n 3 log n) complexity. To further optimize the performance, we adapt the algorithm to cheaply identify consecutive tuples that are guaranteed to belong to the same band-this improves the performance significantly in practice without losing optimality. While unidirectional abcODs are most common in practice, for generality we extend our algorithms with both ascending and descending orders to discover bidirectional abcODs. Finally, we perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets.
We establish an exact correspondence between temporal logic and a subset of TSQL2, a consensus te... more We establish an exact correspondence between temporal logic and a subset of TSQL2, a consensus temporal extension of SQL-92. The translation from temporal logic to TSQLZ developed here enables a user to write high-level queries which can be evaluated against a spaceefficient representation of the database. The reverse translation, also provided, makes it possible to characterize the expressive power of TSQLB. We demonstrate that temporal logic is equal in expressive power to a syntactically defined subset of TSQL2.
Page 1. Querying Multi-granular Compact Representations Rom¯ans Kasperovics and Michael Böhlen Fr... more Page 1. Querying Multi-granular Compact Representations Rom¯ans Kasperovics and Michael Böhlen Free University of Bozen - Bolzano, Dominikanerplatz - P.zza Domenicani 3, 39100 Bozen - Bolzano, Italy Abstract. A common ...
Business Intelligence solutions, encompassing technologies such as multi-dimensional data modelin... more Business Intelligence solutions, encompassing technologies such as multi-dimensional data modeling and aggregate query processing, are being applied increasingly to non-traditional data. This paper extends multi-dimensional aggregation to apply to data with associated interval values that capture when the data hold. In temporal databases, intervals typically capture the states of reality that the data apply to, or capture when the data are, or were, part of the current database state. This paper proposes a new aggregation operator that addresses several challenges posed by interval data. First, the intervals to be associated with the result tuples may not be known in advance, but depend on the actual data. Such unknown intervals are accommodated by allowing result groups that are specified only partially. Second, the operator contends with the case where an interval associated with data expresses that the data holds for each point in the interval, as well as the case where the data holds only for the entire interval, but must be adjusted to apply to sub-intervals. The paper reports on an implementation of the new operator and on an empirical study that indicates that the operator scales to large data sets and is competitive with respect to other temporal aggregation algorithms.
Page 1. On the Completeness of Temporal Database Query Languages Michael Bthlen and Robert Marti ... more Page 1. On the Completeness of Temporal Database Query Languages Michael Bthlen and Robert Marti Institut fiir Informationssysteme, ETH Ziirich 8092 Ziirich, Switzerland Email: (boehlen, marti)~inf.ethz.ch Abstract. In this ...
Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems - GIS '07, 2007
Many database applications deal with spatio-temporal phenomena, and during the last decade a lot ... more Many database applications deal with spatio-temporal phenomena, and during the last decade a lot of research targeted locationbased services, moving objects, traffic jam preventions, meteorology, etc. In strong contrast, there exist only very few proposals for an implementation of a spatio-temporal database system let alone a web-based spatio-temporal information system. This paper describes the design and implementation of a webbased spatio-temporal information system. The system uses Secondo as spatio-temporal DBMS for handling moving objects and MapServer as an OGC-compliant rendering engine for static spatial data. We describe the architecture of the system and compare our system with a standalone application. The paper investigates in detail issues that arise in the context of the web. First, we describe an implementation of a lightweight client that takes advantage of the functionality offered by Secondo and MapServer. Second, we describe how moving objects can be represented in GML. We discuss possible GML representations, propose an extension of GML that uses 3D segments (2D location + time) to represent moving objects, and present experiments that compare the solutions.
Proceedings 14th International Conference on Data Engineering
The association of timestamps with various data items such as tuples or attribute values is funda... more The association of timestamps with various data items such as tuples or attribute values is fundamental to the management of time-varying information. Using intervals in timestamps, as do most data models, leaves a data model with a variety of choices for giving a meaning to timestamps. Specifically, some such data models claim to be point-based while other data models claim to be interval-based. The meaning chosen for timestamps is important-it has a pervasive effect on most aspects of a data model, including database design, a variety of query language properties, and query processing techniques, e.g., the availability of query optimization opportunities. This paper precisely defines the notions of point-based and interval-based temporal data models, thus providing a new, formal basis for characterizing temporal data models and obtaining new insights into the properties of their query languages. Queries in point-based models treat snapshot equivalent argument relations identically. This renders point-based models insensitive to coalescing. In contrast, queries in interval-based models give significance to the actual intervals used in the timestamps, thus generally treating non-identical, but possibly snapshot equivalent, relations differently. The paper identifies the notion of timefragment preservation as the essential defining property of an interval-based data model.
Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337), 1999
In order to better support current and new applications, the major DBMS vendors are stepping beyo... more In order to better support current and new applications, the major DBMS vendors are stepping beyond uninterpreted binary large objects, termed BLOBs, and are beginning to offer extensibility features that allow external developers to extend the DBMS with, e.g., their own data types and accompanying access methods. Existing solutions include DB2 extenders, Informix DataBlades, and Oracle cartridges. Extensible systems offer new and exciting opportunities for researchers and third-party developers alike. This paper reports on an implementation of an Informix DataBlade for the GR-tree, a new R-tree based index. This effort represents a stress test of the perhaps currently most extensible DBMS, in that the new DataBlade aims to achieve better performance, not just to add functionality. The paper provides guidelines for how to create an access method DataBlade, describes the sometimes surprising challenges that must be negotiated during DataBlade development, and evaluates the extensibility of the Informix Dynamic Server.
Database Systems for Advanced Applications '97 - Proceedings of the Fifth International Conference on Database Systems for Advanced Applications, 1997
A wide range of database applications manage timevarying data, and it is well-known that querying... more A wide range of database applications manage timevarying data, and it is well-known that querying and correctly updating time-varying data is dificult and error-prone when using standard SQL. Temporal extensions of SQL ofSeer substantial benefits over SQL when managing time-varying data. The topic of this paper is the effective implementation of temporally extended SQL's. Traditionally, it has been assumed that a temporal DBMS must be built from scratch, utilizing new technologies for storage, indexing, query optimization, concurrency control, and recovery. In contrast, this paper explores the concepts and techniques involved in implementing a temporally enhanced SQL while maximally reusing the facilities of an existing SQL implementation. The topics covered span the choice of an adequate timestamp domain that includes the time van'able "NOW," a comparison. of query processing architectures, and transaction processing, the latter including how to ensure ACID properties and assign timestamps to updates.
Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e s... more Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e series publishes 50-to 125 page publications on topics pertaining to data management. e scope will largely follow the purview of premier information and computer science conferences, such as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not are limited to: query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.
Proceedings of the 1998 ACM symposium on Applied Computing - SAC '98, 1998
In areas such as finance, marketing, and property and resource management, many database applicat... more In areas such as finance, marketing, and property and resource management, many database applications manage spatio-temporal data. These applications typically run on top of a relational DBMS and manage spatio-temporal data either using the DBMS, which provides little support, or employ the services of a proprietary system that co-exists with the DBMS, but is separate from and not integrated with the DBMS. This wealth of applications may benefit substantially from built-in, integrated spatio-temporal DBMS support. Providing a foundation for such support is an important and substantial challenge. This paper initially defines technical requirements to a spatio-temporal DBMS aimed at protecting business investments in the existing legacy applications and at reusing personnel expertise. These requirements provide a foundation for making it economically feasible to migrate legacy applications to a spatio-temporal DBMS. The paper next presents the design of the core of a spatio-temporal, multi-dimensional extension to SQL-92, called STSQL, that satisfies the requirements. STSQL does so by supporting so-called upward compatible, dimensional upward compatible, reducible, and non-reducible queries. In particular, dimensional upward compatibility and reducibility were designed to address migration concerns and complement proposals based on abstract data types.
We define, compute, and evaluate nested surfaces for the purpose of visual data mining. Nested su... more We define, compute, and evaluate nested surfaces for the purpose of visual data mining. Nested surfaces enclose the data at various density levels, and make it possible to equalize the more and less pronounced structures in the data. This facilitates the detection of multiple structures, which is important for data mining where the less obvious relationships are often the most interesting ones. The experimental results illustrate that surfaces are fairly robust with respect to the number of observations, easy to perceive, and intuitive to interpret. We give a topology-based definition of nested surfaces and establish a relationship to the density of the data. Several algorithms are given that compute surface grids and surface contours, respectively.
Large parts of today's data is stored in text documents that undergo a series of changes during t... more Large parts of today's data is stored in text documents that undergo a series of changes during their lifetime. For instance during the development of a software product the source code changes frequently. Currently, managing such data relies on version control systems (VCSs). Extracting information from large documents and their different versions is a manual and tedious process. We present Qvestor, a system that allows to declaratively query documents. It leverages information about the structure of a document that is available as a context-free grammar and allows to declaratively query document versions through a grammar annotated with relational algebra expressions. We define and illustrate the annotation of grammars with relational algebra expressions and show how to translate the annotations to easy to use SQL views.
Temporal aggregation is an important operationin temporal databases, and different variants there... more Temporal aggregation is an important operationin temporal databases, and different variants thereof havebeen proposed. In this paper, we introduce a novel temporalaggregation operator, termed parsimonious temporal aggregation (PTA), that overcomes major limitations of existingapproaches. PTA takes the result of instant temporal aggregation (ITA) of size n, which might be up to twice as largeas the argument relation, and merges similar tuples untila given error () or size (c) bound is reached. The newoperator is data-adaptive and allows the user to control thetrade-off between the result size and the error introduced bymerging. For the precise evaluation of PTA queries, we propose two dynamic programmingbased algorithms for sizeand error-bounded queries, respectively, with a worst-casecomplexity that is quadratic in n. We present two optimizations that take advantage of temporal gaps and differentaggregation groups and achieve a linear runtime in experiments with real-world data. For the quick computation ofan approximate PTA answer, we propose an efficient greedymerging strategy with a precision that is upper bounded byO(log n). We present two algorithms that implement thisstrategy and begin to merge as ITA tuples are produced. Theyrequire O(n log(c +)) time and O(c +) space, where is the size of a read-ahead buffer and is typically very small. An empirical evaluation on real-world and synthetic datashows that PTA considerably reduces the size of the aggregation result, yet introducing only small errors. The greedyalgorithms are scalable for large data sets and introduce lesserror than other approximation techniques.
Page 1. The VLDB Journal DOI 10.1007/s00778-012-0263-0 REGULAR PAPER Measuring structural similar... more Page 1. The VLDB Journal DOI 10.1007/s00778-012-0263-0 REGULAR PAPER Measuring structural similarity of semistructured data based on information-theoretic approaches Sven Helmer · Nikolaus Augsten · Michael Böhlen ...
Uploads
Papers by Michael Böhlen