Skip to main content

Arun Swami

Stanford University, Computer Science, Alumnus

Followers

11

Following

3

Co-authors

2

Public Views

InterestsView All (6)

Uploads

Papers by Arun Swami

A Validated Cost Model for Main Memory Databases

Optimization of large join queries: combining heuristics and combinatorial techniques

ACM SIGMOD Record, 1989

... a certain length of optimization time, the quality of the solutions obtained by the methods i... more ... a certain length of optimization time, the quality of the solutions obtained by the methods in that ... This comparison process is repeated for different optimization time limits ... the heuristics with the techniques described in Section 3. Another reason for studying different combina-tions ...

Jeffrey F. Naughtont

Efficient Similarity Search In Sequence Databases

Lecture Notes in Computer Science

We propose an indexing method for time sequences for processing similarity queries. We use the Di... more We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the rst few frequencies are strong. Another important observation is Parseval's theorem, which speci es that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lower-dimensionality space by using only the rst few Fourier coe cients, we use R -trees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coe cients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.

Sequential procedures for query size estimation

Set-Oriented Mining for Association Rules

Mining Associations Between Sets of Items in Massive Databases

We are given a large database of customer transactions. Each transaction consists of items purcha... more We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an e cient algorithm that generates all signi cant association rules between items in the database. The algorithm incorporates bu er management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithm.

Generalized partial indexes

Proceedings of the Eleventh International Conference on Data Engineering, 1995

This paper demonstrates the use of generalized partial indexes for efficient query processing. We... more This paper demonstrates the use of generalized partial indexes for efficient query processing. We propose that partial indexes be built on those portions of the database that are statistically likely to be the most useful for query processing. We identify three classes of statistical information, and two levels at which it may be available. We describe indexing strategies that use this information to significantly improve average query performance. Results from simulation experiments demonstrate that the proposed generalized partial indexing strategies perform well compared to the traditional approach to indexing.

Clustering Data Without Distance Functions

Data mining is being applied with profit in many applications. Clustering or segmentation of data... more Data mining is being applied with profit in many applications. Clustering or segmentation of data is an important data mining application. One of the problems with traditional clustering methods is that they require the analyst to define distance functions that are not always available. In this paper, we describe a new method for clustering without distance functions.

Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

Page 1. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics... more

Sharing Processing in Data Mining System

An Interval Classifier for Database Mining Applications

We are given a large population database that contains information about population instances. Th... more

A One-Pass Space-Efficient Algorithm for Finding Quantiles

We present an algorithm for nding the quantile values of a large unordered dataset with unknown d... more We present an algorithm for nding the quantile values of a large unordered dataset with unknown distribution. The algorithm has the following features: i) it requires only one pass over the data; ii) it is space e cient | it uses a small bounded amount of memory independent of the number of values in the dataset; and iii) the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm. Empirical evaluation using synthetic data with various distributions as well as real data show that the bounds obtained are quite tight. The algorithm has several applications in database systems, for example in database governors, query optimization, load balancing in multiprocessor database systems, and data mining.

Clustering association rules

Proceedings 13th International Conference on Data Engineering, 1997

We consider the problem of clustering t wo-dimensional association rules in large databases. We p... more We consider the problem of clustering t wo-dimensional association rules in large databases. We present a geometric-based algorithm, BitOp, for performing t he clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the u s e r d esires to segment the d ata. We m easure the quality o f t he segmentation generated by A R CS using t he Minimum Description Length MDL principle of encoding t he clusters on several databases including noise and errors. Scale-up experiments show t hat A R CS, using t he BitOp algorithm, scales linearly with t he amount o f d ata.

The parameterized Round-Robin partitioned algorithm for parallel external sort

Proceedings of 9th International Parallel Processing Symposium, 1995

ABSTRACT

On the relative cost of sampling for join selectivity estimation

Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '94, 1994

Page 1. On the Relative Cost of Sampling for Join Selectivity Estimation Peter J. Haas Jeffrey F.... more

Fixed-precision estimation of join selectivity

Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '93, 1993

Page 1. Fixed-Precision Estimation of Join Selectivity Peter J. Haas* Jeffrey F. Naughtont S. Ses... more

<title>Web caching and prefetching: a data mining approach</title>

Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, 2001

ABSTRACT

Online Algorithms for Handling Skew in Parallel Joins

1993 International Conference on Parallel Processing - ICPP'93 Vol3, 1993

ABSTRACT

On the estimation of join result sizes

Lecture Notes in Computer Science, 1994

Good estimates of join result sizes are critical for query optimization in relational database ma... more Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the problem of incrementally obtaining accurate and consistent estimates of join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified algorithm called Algorithm ELS (Equivalence and Largest Selectivity). Prior to computing any result sizes, equivalence classes are determined for the join columns. The algorithm also takes into account the effect of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate to be computed. We show that the algorithm is correct and gives better estimates than current estimation algorithms.

A Validated Cost Model for Main Memory Databases

Optimization of large join queries: combining heuristics and combinatorial techniques

ACM SIGMOD Record, 1989

... a certain length of optimization time, the quality of the solutions obtained by the methods i... more ... a certain length of optimization time, the quality of the solutions obtained by the methods in that ... This comparison process is repeated for different optimization time limits ... the heuristics with the techniques described in Section 3. Another reason for studying different combina-tions ...

Jeffrey F. Naughtont

Efficient Similarity Search In Sequence Databases

Lecture Notes in Computer Science

We propose an indexing method for time sequences for processing similarity queries. We use the Di... more We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the rst few frequencies are strong. Another important observation is Parseval's theorem, which speci es that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lower-dimensionality space by using only the rst few Fourier coe cients, we use R -trees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coe cients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.

Sequential procedures for query size estimation

Set-Oriented Mining for Association Rules

Mining Associations Between Sets of Items in Massive Databases

We are given a large database of customer transactions. Each transaction consists of items purcha... more We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an e cient algorithm that generates all signi cant association rules between items in the database. The algorithm incorporates bu er management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithm.

Generalized partial indexes

Proceedings of the Eleventh International Conference on Data Engineering, 1995

This paper demonstrates the use of generalized partial indexes for efficient query processing. We... more This paper demonstrates the use of generalized partial indexes for efficient query processing. We propose that partial indexes be built on those portions of the database that are statistically likely to be the most useful for query processing. We identify three classes of statistical information, and two levels at which it may be available. We describe indexing strategies that use this information to significantly improve average query performance. Results from simulation experiments demonstrate that the proposed generalized partial indexing strategies perform well compared to the traditional approach to indexing.

Clustering Data Without Distance Functions

Data mining is being applied with profit in many applications. Clustering or segmentation of data... more Data mining is being applied with profit in many applications. Clustering or segmentation of data is an important data mining application. One of the problems with traditional clustering methods is that they require the analyst to define distance functions that are not always available. In this paper, we describe a new method for clustering without distance functions.

Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

Page 1. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics... more

Sharing Processing in Data Mining System

An Interval Classifier for Database Mining Applications

We are given a large population database that contains information about population instances. Th... more

A One-Pass Space-Efficient Algorithm for Finding Quantiles

We present an algorithm for nding the quantile values of a large unordered dataset with unknown d... more We present an algorithm for nding the quantile values of a large unordered dataset with unknown distribution. The algorithm has the following features: i) it requires only one pass over the data; ii) it is space e cient | it uses a small bounded amount of memory independent of the number of values in the dataset; and iii) the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm. Empirical evaluation using synthetic data with various distributions as well as real data show that the bounds obtained are quite tight. The algorithm has several applications in database systems, for example in database governors, query optimization, load balancing in multiprocessor database systems, and data mining.

Clustering association rules

Proceedings 13th International Conference on Data Engineering, 1997

We consider the problem of clustering t wo-dimensional association rules in large databases. We p... more We consider the problem of clustering t wo-dimensional association rules in large databases. We present a geometric-based algorithm, BitOp, for performing t he clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the u s e r d esires to segment the d ata. We m easure the quality o f t he segmentation generated by A R CS using t he Minimum Description Length MDL principle of encoding t he clusters on several databases including noise and errors. Scale-up experiments show t hat A R CS, using t he BitOp algorithm, scales linearly with t he amount o f d ata.

The parameterized Round-Robin partitioned algorithm for parallel external sort

Proceedings of 9th International Parallel Processing Symposium, 1995

ABSTRACT

On the relative cost of sampling for join selectivity estimation

Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '94, 1994

Page 1. On the Relative Cost of Sampling for Join Selectivity Estimation Peter J. Haas Jeffrey F.... more

Fixed-precision estimation of join selectivity

Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '93, 1993

Page 1. Fixed-Precision Estimation of Join Selectivity Peter J. Haas* Jeffrey F. Naughtont S. Ses... more

<title>Web caching and prefetching: a data mining approach</title>

Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, 2001

ABSTRACT

Online Algorithms for Handling Skew in Parallel Joins

1993 International Conference on Parallel Processing - ICPP'93 Vol3, 1993

ABSTRACT

On the estimation of join result sizes

Lecture Notes in Computer Science, 1994

Good estimates of join result sizes are critical for query optimization in relational database ma... more Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the problem of incrementally obtaining accurate and consistent estimates of join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified algorithm called Algorithm ELS (Equivalence and Largest Selectivity). Prior to computing any result sizes, equivalence classes are determined for the join columns. The algorithm also takes into account the effect of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate to be computed. We show that the algorithm is correct and gives better estimates than current estimation algorithms.