... a certain length of optimization time, the quality of the solutions obtained by the methods i... more ... a certain length of optimization time, the quality of the solutions obtained by the methods in that ... This comparison process is repeated for different optimization time limits ... the heuristics with the techniques described in Section 3. Another reason for studying different combina-tions ...
We propose an indexing method for time sequences for processing similarity queries. We use the Di... more We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the rst few frequencies are strong. Another important observation is Parseval's theorem, which speci es that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lower-dimensionality space by using only the rst few Fourier coe cients, we use R -trees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coe cients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.
We are given a large database of customer transactions. Each transaction consists of items purcha... more We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an e cient algorithm that generates all signi cant association rules between items in the database. The algorithm incorporates bu er management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithm.
Proceedings of the Eleventh International Conference on Data Engineering, 1995
This paper demonstrates the use of generalized partial indexes for efficient query processing. We... more This paper demonstrates the use of generalized partial indexes for efficient query processing. We propose that partial indexes be built on those portions of the database that are statistically likely to be the most useful for query processing. We identify three classes of statistical information, and two levels at which it may be available. We describe indexing strategies that use this information to significantly improve average query performance. Results from simulation experiments demonstrate that the proposed generalized partial indexing strategies perform well compared to the traditional approach to indexing.
Data mining is being applied with profit in many applications. Clustering or segmentation of data... more Data mining is being applied with profit in many applications. Clustering or segmentation of data is an important data mining application. One of the problems with traditional clustering methods is that they require the analyst to define distance functions that are not always available. In this paper, we describe a new method for clustering without distance functions.
Page 1. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics... more Page 1. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics Peter J. Haas Arun N. Swami* IBM Almaden Research Center San Jose, CA 95120-6099 {[email protected], [email protected]} Abstract ...
We are given a large population database that contains information about population instances. Th... more We are given a large population database that contains information about population instances. The population is known to comprise of m groups, but the population instances are not labeled with the group identification.
We present an algorithm for nding the quantile values of a large unordered dataset with unknown d... more We present an algorithm for nding the quantile values of a large unordered dataset with unknown distribution. The algorithm has the following features: i) it requires only one pass over the data; ii) it is space e cient | it uses a small bounded amount of memory independent of the number of values in the dataset; and iii) the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm. Empirical evaluation using synthetic data with various distributions as well as real data show that the bounds obtained are quite tight. The algorithm has several applications in database systems, for example in database governors, query optimization, load balancing in multiprocessor database systems, and data mining.
Proceedings 13th International Conference on Data Engineering, 1997
We consider the problem of clustering t wo-dimensional association rules in large databases. We p... more We consider the problem of clustering t wo-dimensional association rules in large databases. We present a geometric-based algorithm, BitOp, for performing t he clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the u s e r d esires to segment the d ata. We m easure the quality o f t he segmentation generated by A R CS using t he Minimum Description Length MDL principle of encoding t he clusters on several databases including noise and errors. Scale-up experiments show t hat A R CS, using t he BitOp algorithm, scales linearly with t he amount o f d ata.
Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '94, 1994
Page 1. On the Relative Cost of Sampling for Join Selectivity Estimation Peter J. Haas Jeffrey F.... more Page 1. On the Relative Cost of Sampling for Join Selectivity Estimation Peter J. Haas Jeffrey F. Naughton* Arun N. Swami IBM Almaden Research Center University of Wisconsin - Madison IBM Almaden Research Center peterh@almaden. ...
Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '93, 1993
Page 1. Fixed-Precision Estimation of Join Selectivity Peter J. Haas* Jeffrey F. Naughtont S. Ses... more Page 1. Fixed-Precision Estimation of Join Selectivity Peter J. Haas* Jeffrey F. Naughtont S. Seshadri$ Arun N. Swami* Abstract We compare the performance of sampling-based procedures for estimation of the selectivity of an equijoin. While some ...
Good estimates of join result sizes are critical for query optimization in relational database ma... more Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the problem of incrementally obtaining accurate and consistent estimates of join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified algorithm called Algorithm ELS (Equivalence and Largest Selectivity). Prior to computing any result sizes, equivalence classes are determined for the join columns. The algorithm also takes into account the effect of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate to be computed. We show that the algorithm is correct and gives better estimates than current estimation algorithms.
... a certain length of optimization time, the quality of the solutions obtained by the methods i... more ... a certain length of optimization time, the quality of the solutions obtained by the methods in that ... This comparison process is repeated for different optimization time limits ... the heuristics with the techniques described in Section 3. Another reason for studying different combina-tions ...
We propose an indexing method for time sequences for processing similarity queries. We use the Di... more We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the rst few frequencies are strong. Another important observation is Parseval's theorem, which speci es that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lower-dimensionality space by using only the rst few Fourier coe cients, we use R -trees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coe cients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.
We are given a large database of customer transactions. Each transaction consists of items purcha... more We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an e cient algorithm that generates all signi cant association rules between items in the database. The algorithm incorporates bu er management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithm.
Proceedings of the Eleventh International Conference on Data Engineering, 1995
This paper demonstrates the use of generalized partial indexes for efficient query processing. We... more This paper demonstrates the use of generalized partial indexes for efficient query processing. We propose that partial indexes be built on those portions of the database that are statistically likely to be the most useful for query processing. We identify three classes of statistical information, and two levels at which it may be available. We describe indexing strategies that use this information to significantly improve average query performance. Results from simulation experiments demonstrate that the proposed generalized partial indexing strategies perform well compared to the traditional approach to indexing.
Data mining is being applied with profit in many applications. Clustering or segmentation of data... more Data mining is being applied with profit in many applications. Clustering or segmentation of data is an important data mining application. One of the problems with traditional clustering methods is that they require the analyst to define distance functions that are not always available. In this paper, we describe a new method for clustering without distance functions.
Page 1. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics... more Page 1. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics Peter J. Haas Arun N. Swami* IBM Almaden Research Center San Jose, CA 95120-6099 {[email protected], [email protected]} Abstract ...
We are given a large population database that contains information about population instances. Th... more We are given a large population database that contains information about population instances. The population is known to comprise of m groups, but the population instances are not labeled with the group identification.
We present an algorithm for nding the quantile values of a large unordered dataset with unknown d... more We present an algorithm for nding the quantile values of a large unordered dataset with unknown distribution. The algorithm has the following features: i) it requires only one pass over the data; ii) it is space e cient | it uses a small bounded amount of memory independent of the number of values in the dataset; and iii) the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm. Empirical evaluation using synthetic data with various distributions as well as real data show that the bounds obtained are quite tight. The algorithm has several applications in database systems, for example in database governors, query optimization, load balancing in multiprocessor database systems, and data mining.
Proceedings 13th International Conference on Data Engineering, 1997
We consider the problem of clustering t wo-dimensional association rules in large databases. We p... more We consider the problem of clustering t wo-dimensional association rules in large databases. We present a geometric-based algorithm, BitOp, for performing t he clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the u s e r d esires to segment the d ata. We m easure the quality o f t he segmentation generated by A R CS using t he Minimum Description Length MDL principle of encoding t he clusters on several databases including noise and errors. Scale-up experiments show t hat A R CS, using t he BitOp algorithm, scales linearly with t he amount o f d ata.
Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '94, 1994
Page 1. On the Relative Cost of Sampling for Join Selectivity Estimation Peter J. Haas Jeffrey F.... more Page 1. On the Relative Cost of Sampling for Join Selectivity Estimation Peter J. Haas Jeffrey F. Naughton* Arun N. Swami IBM Almaden Research Center University of Wisconsin - Madison IBM Almaden Research Center peterh@almaden. ...
Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '93, 1993
Page 1. Fixed-Precision Estimation of Join Selectivity Peter J. Haas* Jeffrey F. Naughtont S. Ses... more Page 1. Fixed-Precision Estimation of Join Selectivity Peter J. Haas* Jeffrey F. Naughtont S. Seshadri$ Arun N. Swami* Abstract We compare the performance of sampling-based procedures for estimation of the selectivity of an equijoin. While some ...
Good estimates of join result sizes are critical for query optimization in relational database ma... more Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the problem of incrementally obtaining accurate and consistent estimates of join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified algorithm called Algorithm ELS (Equivalence and Largest Selectivity). Prior to computing any result sizes, equivalence classes are determined for the join columns. The algorithm also takes into account the effect of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate to be computed. We show that the algorithm is correct and gives better estimates than current estimation algorithms.
Uploads
Papers by Arun Swami