Joins on Samples: A Theoretical Guide for Practitioners
Dawei Huang
Dong Young Yoon
Seth Pettie
Barzan Mozafari
University of Michigan, Ann Arbor
arXiv:1912.03443v4 [cs.DB] 24 Jan 2020
{hdawei,
dyoon, pettie, mozafari}@umich.edu
ABSTRACT
Despite decades of research on AQP (approximate query processing), our understanding of sample-based joins has remained limited
and, to some extent, even superficial. The common belief in the
community is that joining random samples is futile. This belief is
largely based on an early result showing that the join of two uniform
samples is not an independent sample of the original join, and that it
leads to quadratically fewer output tuples. Unfortunately, this early
result has little applicability to the key questions practitioners face.
For example, the success metric is often the final approximation’s
accuracy, rather than output cardinality. Moreover, there are many
non-uniform sampling strategies that one can employ. Is sampling
for joins still futile in all of these settings? If not, what is the best
sampling strategy in each case? To the best of our knowledge, there
is no formal study answering these questions.
This paper aims to improve our understanding of sample-based
joins and offer a guideline for practitioners building and using realworld AQP systems. We study limitations of offline samples in
approximating join queries: given an offline sampling budget, how
well can one approximate the join of two tables? We answer this
question for two success metrics: output size and estimator variance. We show that maximizing output size is easy, while there
is an information-theoretical lower bound on the lowest variance
achievable by any sampling strategy. We then define a hybrid sampling scheme that captures all combinations of stratified, universe,
and Bernoulli sampling, and show that this scheme with our optimal parameters achieves the theoretical lower bound within a constant factor. Since computing these optimal parameters requires
shuffling statistics across the network, we also propose a decentralized variant in which each node acts autonomously using minimal
statistics. We also empirically validate our findings on popular SQL
and AQP engines.
PVLDB Reference Format:
Dawei Huang, Dong Young Yoon, Seth Pettie, and Barzan Mozafari. Joins
on Samples: A Theoretical Guide for Practitioners. PVLDB, 13(4): 547560, 2019.
DOI: https://doi.org/10.14778/3372716.3372726
1. INTRODUCTION
This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
[email protected]. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 13, No. 4
ISSN 2150-8097.
DOI: https://doi.org/10.14778/3372716.3372726
Approximate query processing (AQP) has regained significant
attention in recent years due to major trends in the industry [44].
Larger datasets, memory wall, and the separation of compute and
storage have all made it harder to achieve interactive-speed analytics. AQP presents itself as a viable alternative in scenarios where
perfect decisions can be made with imperfect answers [8]. AQP
is most appealing when negligible loss of accuracy can be traded
for a significant gain in speedup or computational resources. Adhoc analytics [9, 58, 64], visualization [23, 50, 57], IoT [47], A/B
testing [7], email marketing and customer segmentation [28], and
real-time threat detection [2] are examples of such usecases.
Sampling and Joins— Sampling is one of the most widely-used
techniques for general-purpose AQP [22]. The high level idea is
to execute the query on a small sample of the original table(s) to
provide a fast, but approximate, answer. While effective for simple aggregates, using samples for join queries has long remained
an open problem [6]. There are two main approaches to AQP: offline or online. Offline approaches [5, 6, 8, 15, 27, 51] build samples (or other synopses) prior to query arrival. At run time, they
simply choose appropriate samples that can yield the best accuracy/performance for each incoming query. Online approaches, on
the other hand, wander-join perform much of their sampling at run
time based on the query at hand [13, 20, 33, 38, 49, 61]. Naturally,
offline sampling leads to significantly higher speedup, while online
techniques can support a much wider class of queries [38]. The
same taxonomy applies to join approximation: offline techniques
perform joins on previously-prepared samples [6, 17, 18, 51, 65],
while online approaches seek to produce a sample of the output of
the join at run time [25, 29, 41, 43]. As mentioned, the latter often means more modest speedups (e.g., 2× [38]) which may not
be sufficient to justify approximation, or additional requirements
(e.g., an index for each join column [41]) which may not be acceptable to many applications. Thus, our focus in this paper—and what
is considered an open-problem—is the offline approach: joins on
samples, not sampling the join’s output.
Joins on Samples— The simplest strategy is as follows. Given two
large tables T1 and T2 , create a uniform random sample of each, say
S1 and S2 respectively, and then use S1 ⊲⊳ S2 to approximate aggregate statistics of T1 ⊲⊳ T2 . This will lead to significant speedup
if samples are much smaller than original tables, i.e., |Ti | ≫ |Si |.
One of the earliest results in this area shows that this simple strategy is futile for two reasons [5]. First, joining two uniform samples
leads to quadratically fewer output tuples, i.e., joining two uniform
samples that are each p fraction (0 ≤ p < 1) of the original tables
will only produce p2 of the output tuples of the original join (see
Figure 1). Second, joining uniform samples of two tables does not
T2
C
a
a
b
b
T1
A
a
b
B
1
2
D
3
4
5
6
50% Uniform
Sample
50% Uniform
Sample
B
1
A
a
⨝
A=C
C
a
b
S1 ⨝ S2
B
C
D
1
a
3
D
3
4
5
6
50% Uniform
Sample
S2
S1
A
a
A
a
a
b
b
T1 ⨝ T2
B
C
1
a
1
a
2
b
2
b
D
3
5
A
a
b
S(T1 ⨝ T2)
B
C
D
1
a
3
2
b
5
Figure 1: A toy example of joining two uniform samples (left)
versus a uniform sample of the join (right).
yield an independent sample of their join1 (see Section 2.1 for details). The dependence of the output tuples can drastically lower
the approximation accuracy [5, 17].
Prior Work— Universe sampling [31, 38, 51] addresses the first
drawback of uniform sampling. Although universe sampling avoids
quadratic reduction of output, its creates even more correlation in
its output, leading to much lower accuracy (see Section 3.1).
Atserias et al. provide a worst case lower bound for any query
involving equi-joins on multiple relations, showing that computing
exact joins with a small memory or time budget is hard [12]. For instance, the maximum possible join size for any cyclic join on three
n-tuple relations is Θ(n1.5 ). Thus, a natural question is whether
approximating joins is also hard with small memory or time.
Our Goal— This paper focuses on understanding the limitation
of using offline samples in approximating join queries. Given a
sampling budget, how well can we approximate the join of two tables using their offline samples? To answer this question, we must
first define what constitutes a “good” approximation of a join. We
consider two metrics: (1) output cardinality and (2) aggregation accuracy. The former is the number of tuples of the original join that
also appear in the join of the samples, whereas the latter is the error
of the aggregates estimated from the sample-based join with respect to their true values, if computed on the original join. Because
in this paper we only consider unbiased estimators, we measure
approximation error in terms of the variance of our estimators.
For the first metric, we provide a simple proof showing that universe sampling is optimal, i.e. no sampling scheme with the same
sampling rate can outperform universe sampling in terms of the
(expected) output cardinality. However, as we show in Section 3.1,
retaining a large number of join tuples does not imply accurate aggregates. It is therefore natural to also ask about the lowest variance that can be achieved given a sampling rate. To the best of
our knowledge, this has remained an open problem to date. For the
first time, we formally study this problem and offer an informationtheoretical lower bound to this question. We also present a hybrid
sampling scheme that matches this lower bound within a constant
factor. This scheme involves a centralized computation, which can
become prohibitive for large tables due to large amounts of statistics that need to be shuffled across the network. Thus, we also pro1
Prior work has stated that joining uniform samples is not a uniform
sample of the join [6]. We avoid this terminology since uniform
means equal probability of inclusion, and in this case each tuple
does appear in the join of the uniform samples with equal probability, but not independently. In other words, joining two i.i.d.
samples is an identical, but not independent, sample of the join.
pose a decentralized variant that only shuffles a minimal amount of
information across the nodes—such as the table size and maximum
frequency—but still achieves the same worst case guarantees. Finally, we generalize our sampling scheme to accommodate a priori
information about filters (i.e., WHERE clause).
In this paper, we make the following contributions:
1. We discuss two metrics—output size and estimator’s variance
—for measuring the quality of join approximation, and show
that universe sampling is optimal for output size and there is an
information-theoretical lower bound for variance (Section 3).
2. We formalize a hybrid scheme, called Stratified-Universe-Bernoulli
Sampling (SUBS), which allows for different combinations of
stratified, universe, and Bernoulli sampling. We derive optimal sampling parameters within this scheme, and show that they
achieve the theoretical lower bound of variance within a constant factor (Section 4–5.3). We also extend our analysis to accommodate additional information regarding the WHERE clause
(Section 6).
3. Through extensive experiments, we also empirically show that
our optimal sampling parameters achieve lower error than existing sampling schemes in both centralized and decentralized
scenarios (Section 7).
2. BACKGROUND
In this section, we provide the necessary background on samplingbased join approximation. We also formally state our problem setting and assumptions.
2.1
Sampling in Databases
The following are the three main popular sampling strategies
(operators) used in AQP engines and database systems.
1. Uniform/Bernoulli Sampling. Any strategy that samples all tuples with the same probability is considered a uniform (random)
sample. Since enforcing fixed-size sampling without replacement is expensive in distributed systems, Bernoulli sampling is
considered a more efficient strategy [38]. In Bernoulli sampling,
each tuple is included in the sample independently, with a fixed
sampling probability p. In this paper, for simplicity, we use “uniform” and “Bernoulli” interchangeably. As mentioned in Section 1, joining two uniform samples leads to quadratically fewer
output tuples. Further, it does not guarantee an i.i.d. sample of
the original join [6]: the output is a uniform sample of the join
but not an independent one. Consider an arbitrary tuple of the
join, say (t1 , t2 ), where t1 is from the first table and t2 is from
the second. The probability of this tuple appearing in the join
of the samples is always the same value, i.e., p2 . The output
is thus a uniform sample. However, the tuples are not independent: consider another tuple of the join, say (t1 , t′2 ) where t′2 is
another tuple from the second table joining with t1 . If (t1 , t2 )
appears in the output, the probability of (t1 , t′2 ) also appearing
becomes p instead of p2 , which would be the probability if they
were independent.
2. Universe Sampling. Given a column2 J, a (perfect) hash function h : J 7→ [0, 1], and a sampling rate p, this strategy includes
a tuple t in the table if h(t.J) ≤ p. Universe sampling is often
used for equi-joins, in which the same p value and hash function
h are applied to the join columns in both tables. This ensures
that when a tuple t1 is sampled from one table, any matching
2
J can also be a set of multiple columns.
tuple t2 from the other table is also sampled, simply because
t1 .J = t2 .J ⇔ h(t1 .J) = h(t2 .J). This is why joining two
universe samples of rate p produces p fraction of the original
join output in expectation. The output is a uniform sample of the
original join, as each join tuple appears with the same probability
p. However, there is more dependence among the output tuples.
Consider two join tuples (t1 , t2 ) and (t′1 , t′2 ) where t1 , t′1 , t2 , t′2
all share the same join key. Then, if (t1 , t2 ) appears, the probability of (t′1 , t′2 ) also appearing will be 1. Likewise, if (t1 , t2 )
does not appear, the probability of (t′1 , t′2 ) appearing will be 0.
Higher dependence means lower accuracy (see Section 3.1).
3. Stratified Sampling. The goal of stratified sampling is to ensure
that minority groups are sufficiently represented in the sample.
Groups are defined according to one or multiple columns, called
the stratified columns. A group (a.k.a. a stratum) is a set of tuples
that share the same value under those stratified columns. Given
a set of stratified columns C and an integer parameter ktuple ,
a stratified sampling is a scheme that guarantees at least ktuple
tuples are sampled uniformly at random from each group. When
a group has fewer than ktuple tuples, all of them are retained.
2.2
Notation
T1 , T2
Si
J
W
C
U
a, b
av , b v
Jˆagg
ǫ
n1 , n2
h
ktuple
kkey
Quality Metrics
Different metrics can be used to assess the quality of a join approximation. In this paper, we focus on the following two, which
are used by most AQP systems.
Output Size/Cardinality— This metric is the number of tuples of
the original join that also appear in the join of the samples. It is
mostly relevant for exploratory usecases, where users visualize or
examine a subset of the output. In other cases, where an aggregate is computed from the join output, retaining a large number of
output tuples does not guarantee accurate answers (we show this in
Section 3.1).
Variance— In scenarios where an aggregate function needs to be
calculated from the join output, the error of the aggregate approximation is more relevant than the number of intermediate tuples
generated. For most non-extreme statistics, there are readily available unbiased estimators, e.g., Horvitz-Thompson estimator [34].
Thus, a popular indicator of accuracy is the variance of the estimator [8], which determines the size of the confidence interval given
a sample size.
2.3
Table 1: Notations.
Problem Statement
In this section, we formally state the problem of sample-based
join approximation. The notations used throughout the paper are
listed in Table 1.
p
q
Definition
Two tables for the join
A sample generated from table Ti
Column(s) used for the join between T1 and T2
Column being aggregated (e.g., SUM, AVG)
Column(s) used for filters (i.e., WHERE clause)
Set of all possible values of J
Frequency vectors of T1 and T2 ’s join columns, resp.
Number of tuples with join value v in
T1 and T2 , resp.
Estimator for a join query with
aggregate function agg
Sampling budget w.r.t. the original table size
Number of tuples in T1 and T2 , resp.
A (perfect) hash function
Minimum number of tuples to be kept per group
in stratified sampling
Minimum number of join keys per group to apply
universe sampling (universe sampling is not applied
to groups with fewer than kkey join keys)
Sampling rate of universe sampling
Sampling rate of uniform sampling
To formally study this problem, we first need to define a class
of reasonable sampling strategies. In Section 4, we define a hybrid scheme that can capture different combinations of stratified,
universe, and uniform sampling.
2.4
Scope and Limitations
To simplify our analysis, we limit our scope in this paper.
Flat Equi-joins— We focus on equi (inner) joins as the most common form of joins in practice. We also support both WHERE and
GROUPBY clauses. Because our focus is on the join itself, we ignore nested queries and only consider flat (or flattened) queries.
We primarily focus on two-way joins. However, our results extend
to multi-way joins with the same join column(s).
Aggregate Functions— Most AQP systems do not support extreme statistics, such as Min or Max [46]. Likewise, we only consider non-extreme aggregates, and primarily focus on the three basic functions, COUNT, SUM, and AVG. However, we expect our
techniques to easily extend to other mean-like statistics as well,
such as VAR, STDEV, and PERCENTILE.
3. HARDNESS
Query Estimator— Let S1 and S2 be two samples generated offline from tables T1 and T2 , respectively, and qagg be a query that
computes an aggregate function agg on the join of T1 and T2 . A
query estimator Jˆagg (S1 , S2 ) is a function that estimates the value
of agg using two samples rather than the original tables.
In this section, we explain why providing a large output size is
insufficient for approximating joins, and formally show the hardness of approximating common aggregates based on the theory of
communication complexity.
Join Sampling Problem— Given a query estimator Jˆagg and a
sampling budget ǫ ∈ (0, 1], our goal is to create a pair of samples
S1 and S2 —from tables T1 and T2 , respectively— that are optimal
in terms of a given success metric, while respecting a given storage budget epsilon on average. Specifically, we seek S1 and S2
that minimize Jˆagg ’s variance or maximize its output size such that
E[|S1 | + |S2 |] ≤ ǫ × (|T1 | + |T2 |).
Note that we define the sampling budget in terms of an expected
size (rather than a strict one), since sampling schemes are probabilistic in nature and may slightly over- or under-use the budget.
3.1
Output Size
Uniform sampling leads to small output size. If we sample at a
rate q from both table T1 and table T2 , the join of samples contains
only q 2 fraction of T1 ⊲⊳ T2 in expectation. Moreover, the join of
two independent samples of the original tables is in general not an
independent sample of T1 ⊲⊳ T2 , which hurts the sample quality.
In contrast, universe sampling [31, 38] with sample rate p can, in
expectation, sample a p fraction of T1 ⊲⊳ T2 . We prove that this is
optimal (all omitted proofs are deferred to our report [35]).
Theorem 1. No sampling scheme with sample rate α can guarantee more than α fraction of T1 ⊲⊳ T2 in expectation for all possible
inputs.
However, a large number of tuples retained in the join does not
imply that the original join query can be accurately approximated.
As pointed out in [18], universe sampling shows poor performance
in approximating queries when the frequencies of keys are concentrated on a few elements. Consider the following extreme example
with tables T1 and T2 , each comprised of n tuples with a single
value 1 in their join key. In this example, universe sampling with
the sampling rate p produces an estimator of variance n4 /p, while
uniform sampling with rate q has a variance of n2 /q 2 , which is
much lower when p = q and n is large. Thus, a larger output size
does not necessarily lead to a better approximation of the query.
3.2
Approximating Aggregate Queries
In this section, we focus on the core question: why is approximating common aggregates (e.g., COUNT, SUM and AVG) hard
when using a small sample (or more generally, a small summary)?
We address this question using the theory of communication complexity. Specifically, to show that computing COUNT on a join is
hard, we reduce it to set intersection, a canonically hard problem
in communication complexity. Assume that both Alice and Bob
each hold a set of size k, say A and B, respectively. They aim to
estimate the size of t = |A ∩ B|. Pagh et. al [48] show that if Alice
only sends a small summary to Bob, any unbiased estimator that
Bob uses will have a large variance.
Theorem 2 (See [48]). Any one-way communication protocol that
estimates t within relative error δ with probability at least 2/3 must
send at least Ω(k/(tδ 2 ))n bits.
Corollary 3. Any estimator to |A ∩ B| produced by Bob that is
based on an s-bits summary by Alice must have a variance of at
least Ω(kt/s).
Any sample of size s can be encoded using O(log ks ) bits, implying that any estimator to COUNT that is based on a sample of size
s from one of the tables must have a variance of at least Ω(kt/s).
Estimating SUM queries is at least as hard as estimating COUNT
queries, since any COUNT can be reduced to a SUM by setting all
entries in the SUM column to 1.
From the hard instance of set intersection, we can also derive a
hard instance for AVG queries. Based on Theorem 2, any summary
of T1 that can distinguish between intersection size t(1 + δ) and
t(1 − δ) must be at least of size Ω(k/(tδ 2 )) bits. Now we reduce
this problem to estimating an AVG query.
√
Here, the two tables consist of k + t tuples each. The first k
tuples of T1 and T2 are from the hard instance of set intersection,
and the values
√ of their AVG column are set to 2r. The join column
of the last t tuples is set to some common key v ′ that is in the
first k tuples, and their AVG column is set to 0. Therefore, the
intersection size from the first k tuples is at least t(1 + δ) (or at
most (t(1 − δ))) if and only if the result of the AVG query is at least
2rt(1+δ)
= (1 + O(δ))r (or at most 2rt(1+δ)
= (1 − O(δ)r)). By
t(2+δ)
t(2+δ)
re-scaling δ by a constant factor, we can get the following theorem:
Theorem 4. Any summary of T1 that can estimate an AVG query
with precision δ with probability at least 2/3 must have a size of at
least Ω(n/(tδ 2 )).
4. GENERIC SAMPLING SCHEME
To formally argue about the optimality of a sampling strategy,
we must first define a class of sampling schemes. As discussed in
Section 2.1, there are three well-known sampling operators: stratified, universe, and Bernoulli (uniform). However, these atomic operators can themselves be combined. For example, one can apply
universe sampling of rate 0.1 and then Bernoulli sampling of rate
0.2 for an overall effective sampling rate of 0.02.3 To account for
such hybrid schemes, we define a generic scheme that combines
universe and Bernoulli sampling, called UBS.4 We also define a
more generic scheme that combines all three of stratified, universe
and Bernoulli sampling, called SUBS. It is easy to show that the basic sampling operators are a special case of SUBS. First, we define
the effective sample rate.
Definition 5 (Effective sampling rate). We define the effective sampling rate of a sampling scheme as the expected ratio of the size of
the resulting sample to that of the original table.
Definition 6 (Universe-Bernoulli Sampling (UBS)). Given a table
T and a column (or set of columns) J in T , a UBS scheme is defined
by a pair (p, q), where 0 <p≤ 1 is a universe sampling rate and
0 <q≤ 1 is a Bernoulli (or uniform) sampling rate. Let h : U 7→
[0, 1] be a perfect hash function. Then, a sample of T produced by
this scheme, S = UBSp,q (T, J), is produced as follows:
Algorithm 1 UBSp,q (T, J)
S ← ∅ for each tuple t do
if h(t.J) < p then
Include t in S independently w/ prob. q.
end
end
It is easy to see that the effective sampling rate of a UBS scheme
(p, q) is p · q. Thus, the effective sampling rate here is independent
of the actual distribution of the values in the table (and column(s)
J).
The goal of this sampling paradigm is to optimize the tradeoff between universe sampling and Bernoulli sampling in different instances. At one extreme, when each join value appears exactly once in both table, universe sampling leads to lower variance
than Bernoulli sampling. This is because independent Bernoulli
sampling has trouble matching tuples with the same join value,
while universe sampling guarantees that when a tuple is sampled,
all matching tuples in the other table are also sampled. At the other
extreme, if all tuples have the same join value in both tables (i.e.,
the join becomes a Cartesian product of the two tables), universe
sampling will either sample the entire join, or sample nothing at
all, while uniform sampling will have a sample size concentrated
around qN , thus giving an estimator of much lower variance. In
section 5.1 to 5.3, we give a comprehensive discussion on how to
optimize p and q for different tables and different queries.
The Stratified-Universe-Bernoulli Sampling Scheme applies to
a table T that is divided into K groups (i.e., strata), denoted as
G1 , G2 , ... , Gk .
3
Statistically, it does not matter which sampling is applied first:
whether a tuple passes the universe sampler and whether it passes
the Bernoulli sampler are completely independent decisions, and
hence, the output distribution is the same. Here, we apply universe
sampling first only for convenience and without loss of generality.
4
Even if we do not care about output cardinality, universe sampling can still help improve the approximation quality. For example, given two tables of size n with a one-to-one join relationship,
the count estimator’s variance is n/q 2 under Bernoulli sampling
but n/p under universe sampling, which is much lower when p=q.
Definition 7 (Stratified-Universe-Bernoulli Sampling (SUBS)). Given
a table T of N rows and a column (or set of columns) J in T , a
SUBS scheme is defined by a tuple (p1 , p2 , ... , pK , q1 , q2 , ... , qK ),
where 0<pi , qi ≤1 are the universe and Bernoulli sampling rates.
Given a perfect hash function h: U 7→ [0, 1], a sample of T produced by this scheme, S = UBSp,q (T, J), is produced as follows:
Algorithm 2 SUBSp1 ,...,pK ,q1 ,...,qK (T, G, J)
S←∅
for each group Gi do
for each tuple t in Gi do
if h(t.J) < pi then
Include t in S independently w/ prob. qi .
end
end
end
Let |Gi | denote the number of tuples in group
P Gi . Then the
effective sampling rate of a SUBS scheme is i pi · qi · |Gi |/N .
We call ǫi = p · qi the effective sampling rate for group Gi .
In both UBS and SUBS schemes, the user specifies ǫ as their
desired sampling budget, given which our goal is to determine optimal sampling parameters p and q (or pi and qi values) such that
the variance of our join estimator is minimized. In Section 5, we
derive the optimal p and q for UBS. For SUBS, in addition to ǫ,
the user also provides two additional parameters kkey and ktuple
(explained below). Next, we show how to determine the effective
sampling rate ǫi for each group Gi based on these parameters in
SUBS. Given ǫi for each group, the problem is then reduced to
finding the optimal parameters for UBS for that group (i.e., pi and
qi ). Moreover, as we will show in Sections 5.1–5.3, particularly
in Lemma 9, the universe sampling rate for every group must be
the same, and must be the same as the universe sampling rate of
the other table in two-way joins. Hence, we use a single universe
sampling rate p = p1 = ... = pk across all groups.
As mentioned in Section 2.1, ktuple is a user-specified lower
bound on the minimum number of tuples5 in each group the sample
must retain. kkey is an additional user-specified parameter required
for the SUBS scheme. It specifies a threshold at which to activate
the universe sampler. In particular, if a group contains too few (i.e.,
less than kkey ) join keys, we do not perform any universe sampling
as it will have a high chance of filtering out all tuples. Hence, we
apply universe sampling only to those groups with ≥ kkey join
keys. For groups with fewer than kkey join keys, we will only apply Bernoulli sampling with rate ǫi .
We call a group large if it contains at least kkey join keys, otherwise, we call it a small group. We use Nb to denote the total
number of tuples in all large groups, and Ns to denote the total
number of tuples in all small groups. Similarly, let Mb and Ms denote the number of large and small groups, respectively. Then, we
decide the sampling budget ǫi for each group Gi as follows:
1. If Ms ktuple > ǫNs or Mb ktuple > ǫNb , we notify the user that
creating a sample given their parameters is infeasible.
2. Otherwise,
K ·k
• Let ǫ′s = s Ntuple
and let ǫ′′s = ǫ − ǫ′s . Then for each
s
small group Gi , the sampling budget is ǫi =
5
ktuple
|Gi |
+ ǫ′′s .
The lower bound holds only on average, due to the probabilistic
nature of sampling.
• Let ǫ′b =
Kb ·ktuple
Nb
and let ǫ′′b = ǫ − ǫ′b . Then for each
large group Gi , the sampling budget is ǫi =
ktuple
|Gi |
+ ǫ′′b .
Once ǫi is determined for each group, the problem of deciding
optimal SUBS parameters is reduced to deciding the optimal SUBS
parameters for K separate groups. This effective sampling rate ǫi
guarantees that each large group will have at least t tuples in the
sample on average, and the remaining budget is divided evenly.
Thus, the corresponding uniform sampling rate for each large group
is qi = ǫi /p. Moreover, we pose the constraint that the universe
sampling rate p should be at least 1/s to guarantee that, on average,
there is at least one join key passing through the universe sampler.
For small groups, we simply apply uniform sampling with rate
ǫi . This is equivalent to setting p = 1 for these groups.
Overall, this strategy provides the following guarantees:
1. Each group will have at least t tuples in the sample, on average.
2. The probability of each group being missed is at most (1 −
1/s)s < 0.367. In general, if we set p>c/s for some constant
c>1, this probability will become 0.367c .
3. The approximation of the original query will be optimal in terms
of its variance (see Sections 5.1–5.3).
5. OPTIMAL SAMPLING
As shown in Section 4, finding the optimal sampling parameters
within the SUBS scheme can be reduced to finding those within the
UBS scheme. Thus, in this section, we focus on deriving the UBS
parameters that minimize error for each aggregation type (COUNT,
SUM, and AVG). Initially, we also assume there is no WHERE clause.
Later, in Section 6, we show how to handle WHERE conditions and
how to create a single sample instead of creating one per each aggregation type and WHERE condition.
Centralized vs. Decentralized— For each aggregation type, we
analyze two scenarios: centralized and decentralized. Centralized
setting is when the frequencies of the join keys in both tables are
known. This represents situations where both tables are stored on
the same server, or each server communicates its full frequency
statistics to other parties. Decentralized setting is a scenario where
the two tables are each stored on a separate server [63], and exchanging full frequency statistics across the network is costly.6
Decentralized Protocols— In a decentralized setting, each party
(i.e., server) only has access to full statistics of its own table (e.g.,
frequencies, join column distribution). The goal then is for each
party to determine its sampling strategy, while minimizing communications with the other party. Depending on the amount of information exchanged, one can pursue different protocols for achieving this goal. In this paper, we study a simple sampling protocol,
which we call D ICTATORSHIP. Here, one server, say party1, is
chosen as the dictator. We also assume that the parties know each
other’s sampling budgets and table sizes (ǫ1 , ǫ2 , |T1 |, and |T2 |).
The dictator observes the distributional information of its own table, say T1 , and decides a shared universe sampling rate p between
max{ǫ1 , ǫ2 } and 1. This p is sent to the other server (party2) and
both servers use p as their universe sampling rate.7 Their uniform
sampling rates will thus be q1 = ǫ1 /p and q2 = ǫ2 /p, respectively.
Since party1 only has T1 ’s frequency information, it chooses
an optimal value of p that minimizes the worst case variance of
6
Here, we focus on two servers, but the math can easily be generalized to decentralized networks of multiple servers.
7
Using the same universe sampling rate is justified by Lemma 9.
Jˆagg , i.e., the variance when the frequencies in T2 are chosen adversarially. This can be formulated as a robust optimization [45]:
p∗ =
arg min
max{ǫ1 ,ǫ2 }≤p≤1
max Var[Jˆagg ]
b
(1)
where b ranges over all possible frequency vectors of T2 . In the rest
of this paper, we use D ICTATORSHIP in our decentralized analysis
(we defer more complex protocols to [35]).
5.1
Since each term in Theorem 10 that depends on p is proportional
either to p or 1/p, to find a p that minimizes the variance, one can
simply set the first order derivatives (with respect to p) to 0.
Theorem 11. Let T1 and T2 be two tables joined on column(s) J.
Let av and bv be the frequency of value v in column(s) J of tables
T1 and T2 , respectively. Given their sampling rates ǫ1 and ǫ2 , the
optimal sampling parameters (p1 , q1 ) and (p2 , q2 ) are given by:
Join Size Estimation: Count on Joins
p1 =p2 = min{1, max{ǫ1 , ǫ2 ,
We start by considering the following simplified query:
select count(*) from T1 join T2 on J
(2)
When the arguments p1 , q1 , p2 , q2 , S1 , S2 are clear from the context, we omit them and simply write Jˆcount .
Lemma 8. Let S1 = UBSp1 ,q1 (T1 , J) and
S2 = UBSp2 ,q2 (T2 , J). The variance of Jˆcount is as follows:
1 − q2
1−p
γ2,2 +
γ2,1
p
pq2
(1 − q1 )(1 − q2 )
1 − q1
+
γ1,2 +
γ1,1 .
pq1
pq1 q2
P
= v aiv bjv .
Var(Jˆcount ) =
where γi,j
To minimize Var(Jˆcount ) under a fixed sampling budget, the
two tables should always use the same universe sampling rate. If
p1 >p2 , the effective universe sampling rate is only p2 , i.e., only p2
fraction of the join keys inside T1 appear in the join of the samples, and the remaining p1 −p2 fraction is simply wasted. Then, we
can change the universe sampling rate of T1 to p2 and increase its
uniform sampling rate to obtain a lower variance.
Lemma 9. Given tables T1 , T2 joined on column(s) J, a fixed
sampling parameter (p1 , q1 ) for T1 , and a fixed effective sampling
rate ǫ2 for T2 , the variance of Jˆcount is minimized when T2 uses
p1 as its universe sampling rate and correspondingly ǫ2 /p1 as its
uniform sampling rate.
Note that Lemma 9 applies to both centralized and decentralized
settings, i.e., it applies to any feasible sampling parameter (p1 , q1 )
and (p2 , q2 ), regardless of how the sampling parameter is decided.
Next, we analyze each setting.
5.1.1 Centralized Sampling for Count
We have the following result.
Theorem 10. When T1 and T2 use sampling parameters (p, ǫ1 /p)
and (p, ǫ2 /p), Jˆcount ’s variance is given by:
1
1
1
− 1)γ2,2 + ( − )γ2,1
p
ǫ2
p
1
p
1
1
1
1
+ ( − )γ1,2 + (
−
−
+ )γ1,1 .
ǫ1
p
ǫ 1 ǫ2
ǫ1
ǫ2
p
Var[Jˆcount ] = (
ǫ1 ǫ2 γ2,2 − γ1,2 − γ2,1 + γ1,1
}}
γ1,1
and q1 =ǫ1 /p, q2 =ǫ2 /p.
where T1 and T2 are two tables joined on column(s) J. Consider
S1 = UBS(p1 ,q1 ) (T1 , J) and S2 = UBS(p2 ,q2 ) (T2 , J). Then, we
can define an unbiased estimator for the above query, Ecount =
|T1 ⊲⊳J T2 |, using S1 and S2 as follows. Observe that given any
pair of tuples t1 ∈ T1 and t2 ∈ T2 , where t1 .J = t2 .J, the probability that (t1 , t2 ) enters S1 ⊲⊳S2 is pmin q1 q2 , where pmin =min{p1 ,
p2 }. Hence, the following is an unbiased estimator for Ecount .
1
Jˆcount (p1 , q1 , p2 , q2 , S1 , S2 ) =
|S1 ⊲⊳ S2 |.
pmin q1 q2
r
Substituting this into Lemma 8, the resulting variance is only a
constant factor of Theorem 2’s theoretical limit. For instance, consider a primary-key-foreign-key join query where av ∈ {0, 1} and
bv is smaller than some constant, say 5, and ǫ1 =ǫ2 =ǫ for any ǫ,
Theorem 11 chooses p1 =p2 =ǫ. Then the variance
given by TheP
orem 10 becomes (1/ǫ − 1)J where J = v av bv is the size of
the join. Since ǫ is the expected ratio of the sample to table size,
the expression (1/ǫ − 1)J matches the lower bound in Corollary 3
except for a constant factor.
5.1.2 Decentralized Sampling for Count
Motivated by Lemma 9, the D ICTATORSHIP protocol uses the
same universe sampling rate p for both parties in the decentralized
setting, by solving the following robust optimization problem:
arg min
max{ǫ1 ,ǫ2 }≤p≤1
max Var[Jˆcount ]
b
Based on Lemma 8 and 11, given the effective sampling rates ǫ1
and ǫ2 , we can express Var[Jˆcount ] as a function of frequencies
{av } and {bv }, and universe sampling rate p as follows.
1
1
1
− 1)γ2,2 + ( − )γ2,1
p
ǫ2
p
1
1
1
1
p
1
−
−
+ )γ1,1 .
+( − )γ1,2 + (
ǫ1
p
ǫ 1 ǫ2
ǫ1
ǫ2
p
Var[Jˆcount ] = (
(3)
Lemma 12. Let a∗ be the maximum frequency in table T1 , v∗ be
any value that has that frequency, and nb be the total number of tuples in T2 . The optimal value for maxb∈Knb Var[Jˆcount ] is given
by ( p1 − 1)a2∗ n2b + ( ǫ12 − p1 )a2∗ nb + ( ǫ11 − p1 )a∗ n2b + ( ǫ1pǫ2 − ǫ11 −
1
− p1 )a∗ nb
ǫ2
In equation (3), given {av } and a fixed p, the variance is a convex
function of the frequency vector {bv }. Thus, the frequency vector
{bv } that maximizes the variance, i.e., the worst case {bv }, is one
where exactly one join key has a non zero frequency. This join
key should be the one with the maximum frequency in T1 . This
is not a representative case and using it to decide a sampling rate
might drastically hinder the performance on average. We therefore
require that both servers also share a simple piece of information
regarding the maximum frequency of the join keys in each table,
say Fa = maxv av and Fb = maxv bv . With this information, the
new optimal sampling rate is given by:
Theorem 13. Given ǫ1 and ǫ2 , the optimal UBS parameter (p, q1 )
and (p, q2 ) for COUNT in the decentralized setting are given by
p
p = min{1, max{ǫ1 , ǫ2 , ǫ1 ǫ2 (Fa Fb − Fa − Fb + 1)}}
and q1 = ǫ1 /p, q2 = ǫ2 /p.
5.2
Sum on Joins
Let Esum be the output of the following simplified query:
select sum(T1.W)
from T1 join T2 on J
Let F be the sum of column W in the joined samples S1 ⊲⊳ S2 .
Then, the following is an unbiased estimator for Esum :
Jˆsum =
1
F
pmin q1 q2
(4)
where pmin = min{p1 , p2 }.
Lemma 14. E[Jˆsum ] = Esum .
Let µv and σv2 be respectively the mean and variance of attribute
W of the tuples in S1 that have the join value v. Further, recall that
av is the number of tuples in T1 with join value v. The following
lemma gives the variance of Jˆsum .
Lemma 15. The variance of Jˆsum is given by:
1 − q2
1 − q1
Var[Jˆsum ] =
β1 +
β2
pq2
pq1
(5)
(1 − q1 )(1 − q2 )
1−p
+
β4
β3 +
pq1 q2
p
P 2 2
2
where β1 = v av µv bv , β2 = av (µv + σv2 )b2v , β3 = av (µ2v +
σv2 )bv and β4 = a2v µ2v b2v .
Analogous to Lemma 9, we have the following result.
Lemma 16. Given tables T1 , T2 joined on column(s) J, fixed sampling parameters (p1 , q1 ) for T1 , and a fixed effective sampling rate
ǫ2 ≤ p1 for T2 , the variance of Jˆsum is minimized when T2 also
uses p1 as its universe sampling rate and correspondingly, ǫ2 /p1
as its uniform sampling rate.
5.2.1 Centralized Sampling for Sum
Based on Lemma 16, we use the same universe sampling rate
p ≥ ǫ1 , ǫ2 for both tables, with their corresponding uniform sampling rates being q1 = ǫ1 /p and q2 = ǫ2 /p. Then we can further
simplify equation 5 into:
Theorem 17. When T1 and T2 both use the universe sampling rate
p and respectively use the uniform sampling rate q1 = ǫ1 /p and
q2 = ǫ2 /p, the variance of Jˆsum is given by:
X 1
1
1
1
( − )β1 + ( − )β2
Var[Jˆsum ] =
ǫ
p
ǫ
p
2
1
v
+(
1
1
1
1
p
−
−
+ )β3 + ( − 1)β4 .
ǫ 1 ǫ2
ǫ1
ǫ2
p
p
Theorem 18. Given effective sampling rates ǫ1 , ǫ2 , the optimal
sampling parameters forq
SUM in a centralized setting are given by
p= min{1, max{ǫ1 , ǫ2 ,
2 −β4
ǫ1 ǫ2 β1 +β3β−β
}}, q1 = ǫp1 and q2 = ǫp2 .
3
5.2.2 Decentralized Sampling for Sum
Lemma 16 implies that, in a decentralized setting for SUM estimation, the universe sampling rate p must be decided by the party
that has T1 , i.e., the table with the aggregate column.
Given a fixed T1 and p, Var[Jˆsum ] is a strictly convex function
of T2 ’s frequency vector. Hence, the worst case instance is a point
distribution where all tuples in T2 share the same join key. However, for SUM, the worst case distributions in T2 are not the same for
all possible sampling parameters p. Define hv (p) to be Var[Jˆsum ]
as a function of p where T2 ’s frequency vector is all concentrated
on the join key v, and define h∗ (p) = maxv hv (p). Since h∗ (p) is
convex and piece-wise quadratic, its minimum can be attained using a sweepline algorithm (see [21, §8] for details). However, the
memory usage is too costly in practice.
Therefore, we propose a simple sampling scheme whose worst
case variance is at most twice the variance of the optimal scheme.
Instead of using h∗ (p) to keep track of the maximum of all hv (p),
we use an approximate h′ (p) = max{hv1 (p), hv2 (p)}, where
v1 = arg maxv a2v µ2 and v2 = arg maxv av (µ2v + σv2 ) to approximate h∗ (p). The function h′ is much simpler and its minimum can
be easily found using quadratic equations and basic case analysis.
For more details on the algorithm, refer to Appendix B in [35].
Let p′ = arg min h′ (p) and p∗ = arg min h∗ (p). We have:
Lemma 19. For any p≥ǫ1 ,ǫ2 , we have
h∗ (p)
≤h′ (p)≤h∗ (p).
2
Corollary 20. We have: h∗ (p′ ) ≤ 2h∗ (p∗ ).
5.3
Average on Joins
Let Eavg be the output of the following simplified query:
select avg(T1.W)
from T1 join T2 on J
In general, producing an unbiased estimator for AVG is hard.8 Instead, we define and analyze the following estimator. Let S and
C be the SUM and COUNT of column W in S1 ⊲⊳ S2 . We define
our estimator as Jˆavg = S/C. There are two advantages over using separate samples to evaluate SUM and COUNT: (1) we can use
a larger sample to estimate both queries, and (2) since SUM and
COUNT will be positively correlated, the variance of their ratio will
be lower. Due to the lack of a close form expression for the variance of the ratio of two random variables, next we present a first
order bivariate Taylor expansion to approximate the ratio.
Theorem 21. Let S and C be random variables denoting the sum
and cardinality of the join of two samples produced by applying
UBS sampling parameters (p1 , q1 ) to T1 and (p2 , q2 ) to T2 . Let
pmin = min{p1 , p2 }. We have:
Var[S/C] = (
E[S]2 Var[S]
2Cov[S, C]
Var[C]
)(
−
+
) + R (6)
E[C]2 E[S]2
E[S]E[C]
E[C]2
where R is a remainder of lower order terms, and
Cov[S, C] =pmin q1 q2 [(1 − q2 )q1
+(1 − q1 )(1 − q2 )
X
X
a2v µv bv + (1 − q1 )q2
X
v
av µv bv + (1 − pmin )q1 q2
v
av µv b2v
v
X
a2v µv b2v ]
v
and other expectation and variance terms are given by Theorems 10 and
17.
5.3.1 Centralized Sampling for Average
In a centralized setting where av , bv , µv and σv are given for all
Cov[S,C]
E[S]2 Var[S]
Var[C]
v, every term in the expression E[C]
2 ( E[S]2 − 2 E[S]E[C] + E[C]2
that depends on p is proportional to either p or 1/p.9
Thus, similar to Theorems 11 and 18, we can again find a p that
minimizes the variance given by Theorem 21. We defer the exact expression of the optimal parameter to [35] due to space constraints.
8
The denominator, i.e., the size of the sampled join, can even be
zero. Furthermore, the expectation of a random variable’s reciprocal is not equal to the reciprocal of its expectation.
9
Notice that E[S]/E[C] is independent of p.
5.3.2 Decentralized Sampling for Average
Minimizing the worst case variance for AVG (for the decentralized setting) is much more involved than the average case. In most
cases, the objective function (variance) is neither convex nor concave in T2 ’s frequencies. However, note that every term in Theorem 21 is an inner product hx, yi, where x and y are two vectors
stored on party1 and party2, respectively. Fortunately, inner
products can be approximated by transferring a very small amount
of information using the AMS sketch [11, 24]. With such a sketch,
we can derive an approximate sampling rate without communicating the full frequency statistics.
6. MULTIPLE QUERIES AND FILTERS
Creating a separate sample for each combination of aggregation
function, aggregation column, and WHERE clause is clearly impractical. In this section, we show how to create a single sample per join
pattern that supports multiple queries at the cost of some possible
loss of approximation quality. First, we ignore the WHERE clauses
and then show how they can be handled too.
Multiple Tables and Queries— We formulate our input as a graph
G=hV, Ei. The vertex set V is the set of all table and join key
pairs, and the edge set E corresponds to all join queries of interest. Specifically, for every join query between tables T1 and T2
on J1 = J2 , we have a corresponding edge e between vertices
(T1 , J1 ) ∈ V and (T2 , J2 ) ∈ V (henceforth, we will use a query
and its corresponding edge interchangeably). This means G is a
multigraph, with potentially parallel edges or self-loops. For each
vertex v = (T, J) ∈ V , we must output a sampling budget ǫv as
well as the corresponding universe sampling rate pv , which will be
used to create a sample S = UBSpv ,ǫv /pv (T, J). This sample will
be used for any query that involves a join with T on column(s) J.
According to Lemmas 8 and 15, and Theorem 21, for each edge
e=(v1 ,v2 )∈E, we can express the estimator variance of its corresponding query as a function of ǫv1 , ǫv2 , pv1 , pv2 and pe , where pe
is an auxiliary variable denoting the minimum of p1 and p2 :
p1
p2
p1 p2
1
)
(Ae +Be
+Ce
+De
pe
ǫ v1
ǫv2
ǫv1 ǫv2
(7)
where Ae , Be , Ce , De are constants that depend on the distributional information of the tables in v1 and v2 . To cast this as an
optimization problem, we also take in a user specified weight ωe
for each edge e and express our objective as:
X
F =
(8)
ωe fe (pe , ǫv1 , ǫv2 , pv1 , pv2 )
fe (p, ǫv1 , ǫv2 , pv1 , pv2 )=
e=(v1 ,v2 )∈E
The choice of ωe values is up to the user. For example, they can
all be set to 1, or to the relative frequency, importance, or probability of appearance (e.g., based on past workloads) of the query
corresponding to e. Then, to find the optimal sampling parameters
we solve the following optimization:
min
ǫv ,pv ,pe
F
subject to
Σ
v=(T,J)∈V
ǫv · size(T ) ≤ B
(9)
where size(T ) is the storage footprint of table T , and B is the
overall storage budget for creating samples. Note that by replacing
the non-linear pe = min(pv1 , pv1 ) constraints with pe ≤ pv1 and
pe ≤ pv2 , (9) is reduced to a smooth optimization problem, which
can be solved numerically with off-the-shelf solvers [14].
Known Filters— To incorporate WHERE clauses, we simply regard
a query with a filter c on T1 ⊲⊳ T2 as a query without a filter but on
a sub-table that satisfies c, namely T ′ = σc (T1 ⊲⊳ T2 ).
Unknown Filters with Distributional Information— When the
columns appearing in the WHERE clause can be predicted but the
exact constants are unknown, a similar technique can be applied.
For example, if an equality constraint C > x is anticipated but x
may take on 100 different values, we can conceptually treat it as
100 separate queries, each with a different value of x in its WHERE
clause. This reduces our problem to that of sampling for multiple
queries without a WHERE clause, which we know how to handle
using equation (8).10 Here, the weight ωi can be used to exploit
any distributional information that might be available. In general,
ωi should be set to reflect the probability of each possible WHERE
clause appearing in the future. For example, if there are R possible
WHERE clauses and all are equally likely, we can set ωi = 1/R,
but if popular values in a column are more likely to appear in the
filters, we can use the column’s histogram to assign ωi .
Unknown Filters— When there is no information regarding the
columns (or their values) in future filters, we can take a different
approach. Since the estimator variance is a monotone function in
the frequencies of each join key (see Theorem 10, Theorems 17 and
21), the larger the frequencies, the larger the variance. This means
the worst case variance always happens when the WHERE clause
selects all tuples from the original table. Hence, in the absence
of any distributional information regarding future WHERE clauses,
we can simply focus on the original query without any filters to
minimize our worst case variance.
7. EXPERIMENTS
Our experiments aim to answer the following questions:
(i) How does our optimal sampling compare to other baselines
in centralized and decentralized settings? (§7.2, §7.3)
(ii) How well does our optimal UBS sampling handle join queries
with filters? (§7.4)
(iii) How does our optimal UBS sampling perform when using a
single sample for multiple queries? (§7.5)
(iv) How does our optimal SUBS sampling compare to existing
stratified sampling strategies? (§7.6)
(v) How much does a decentralized setting reduce the resource
consumption and sample creation overhead? (§7.7)
7.1
Experiment Setup
Hardware and Software— We borrowed a cluster of 18 c220g5
nodes from CloudLab [4]. Each node was equipped with an Intel Xeon Silver 4114 processor with 10 cores (2.2Ghz each) and
192GB of RAM. We used Impala 2.12.0 as our backend database
to store data and execute queries.
Datasets— We used several real-life and synthetic datasets:
1. Instacart [1]. This is a real-world dataset from an online grocery. We used their orders and order products tables (3M and
32M tuples, resp.), joined on order id.
2. Movielens [32]. This is a real-world movie rating dataset. We
used their ratings and movies tables (27M and 58K tuples, resp.),
joined on movieid.
3. TPC-H [3]. We used a scale factor of 1000, and joined l orderkey
of the fact table (lineitem, 6B tuples) with o orderkey of the largest
dimension table (orders, 1.5B tuples).
10
Note that, even though each query in this case is on a different
table, they are all sub-tables of the same original table, and hence
their sampling rate p is the same.
Table 2: Six UBS baselines, each with different p and q.
p
q
B1
0.001
1.000
B2
0.0015
0.6667
B3
0.003
0.333
B4
0.333
0.003
B5
0.6667
0.0015
B6
1.000
0.001
Table 3: Optimal sampling parameters (centralized setting).
Dataset
S{uniform,uniform}
S{uniform,normal}
S{uniform,power1}
S{uniform,power2}
S{normal,uniform}
S{normal,normal}
S{normal,power1}
S{normal,power2}
S{power1,uniform}
S{power1,normal}
S{power1,power1}
S{power2,uniform}
S{power2,normal}
S{power2,power2}
Instacart
Movielens
TPC-H
COUNT
p
q
0.010 0.1
0.012 0.083
1.000 0.001
1.000 0.001
0.012 0.083
0.014 0.069
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
0.01 1.00
0.1
1.00
0.001 1.00
SUM
p
q
0.004 0.264
0.005 0.220
1.000 0.001
1.000 0.001
0.009 0.111
0.011 0.093
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
1.000 0.001
0.01 1.00
0.1
1.00
0.001 1.00
AVG
p
q
0.001 1.000
0.001 1.000
0.692 0.001
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.001 1.000
0.01 1.00
0.1
1.00
0.001 1.00
4. Synthetic. To better control the join key distribution, we also
generated several synthetic datasets, where tables T1 and T2 each
had 100M tuples and a join column J. T1 had an additional column W for aggregation, drawn from a power law distribution
with range [1, 1000] and α=3.5. We varied the distribution of
the join key in each table to be one of uniform, normal, or power
law, creating different datasets (listed in Table 3). The values of
column J were integers randomly drawn from [1, 10M] according to the chosen distribution. Whenever joining with power2 (see
below), we used 100K join keys in both relations. For normal distribution, we used a truncated distribution with σ=1000/5. We
used two different variants of power law distribution for J, one
with α=1.5 and 10M join keys (referred to as power1), and one
with α=2.0 and 100K join keys (referred to as power2). We denote each synthetic dataset according to its tables’ distributions,
S{distribution of T1 ,distribution of T2 }, e.g., S{uniform,uniform}.
sampling when both tables were only moderately skewed (i.e., uniform or normal distributions) for COUNT and SUM, whereas it reduced to a simple uniform sampling for power law distribution.
This is due to the higher probability of missing extremely popular
join keys with universe sampling. To the contrary, for AVG, OPT reduced to a simple universe sampling in most cases. This is because
maximizing the output size in this case was the best way to reduce
variance. For the other datasets (Instacart, Movielens, and
TPC-H), the optimal parameters led to universe sampling, regardless of aggregate type, and their joins were PK-FK, hence making
uniform sampling less useful for the table with primary keys.
Figure 2 shows OPT’s improvement over the baselines in terms
of variance for COUNT queries. Each bar is also annotated with
the relative percentage error of the corresponding baseline. OPT
outperformed all baselines in most cases, achieving over 10x lower
variance than the worst baseline. Figures 3 and 4 show the same
experiment for SUM and AVG. In both cases, OPT achieved the minimum variance across all sampling strategies, except for AVG when
T1 or T2 was a power law distribution. This is because OPT for
AVG was calculated using a Taylor approximation, which is accurate only when the estimators of SUM and COUNT are both within
the proximity of their true values. Moreover, sample variance converges slowly to the theoretical variance, particularly for skew distributions, such as power law. This is why estimated variances
for OPT were not optimal for some Synthetic datasets. However, OPT still achieved the lowest variance across all real-world
datasets, as shown in Figure 5. Here, for the selected join key,
OPT determined that a full universe sampling was the best sampling scheme.
In summary, this experiment highlights OPT’s ability in outperforming simple uniform or universe sampling—or choosing one of
them, when optimal—for aggregates on joins.
7.3
Join Approximation: Decentralized
Variance Calculations— We generated β=500 pairs of samples
for each experiment, and re-ran the queries on each pair, to calculate the variance of our approximations.
We evaluated both OPT and other baselines under a decentralized setting using Instacart and Synthetic datasets. Here,
we constructed a possible worst case distribution for T2 that was
still somewhat realistic, given the distribution of T1 and minimal
information about T2 (i..e, T2 ’s cardinality). To do this, we used
the following steps: 1) let JM AX(T1 ) be the most frequent join key
value in T1 ; 2) assign 75% of the join key values of T2 to have the
value of JM AX(T1 ) and draw the rest of the join key values from a
uniform distribution.
Figure 6 shows the results. For Synthetic, the OPT was the
same under both settings whenever there was a power law distribution or the aggregate was AVG. This is because our assumption
of the worst case distribution for T2 was close to a power law distribution. For COUNT and SUM with Synthetic dataset, OPT in
the decentralized setting had a much higher variance than OPT in
the centralized setting when there was no power law distribution.
With Instacart, OPT in the decentralized setting was the same
as OPT in the centralized setting, which had the minimum variance
among the baselines. This illustrates that OPT in the decentralized
setting can perform well with real-world data where the joins are
mostly PK-FK. This also shows that if a reasonable assumption is
possible on the distribution of T2 , OPT can be as effective in the
decentralized setting as it is in a centralized one, while requiring
significantly less communication.
7.2
7.4
Baselines— We compared our optimal UBS parameters (referred
to as OPT) against six baselines. The UBS parameters of these
baselines, B1 , ... , B6 , are listed in Table 2. B1 and B6 are simply universe and uniform sampling, respectively. B2 , ... , B5 represent different hybrid variants of these sampling schemes. Sampling budgets were ǫ1 = ǫ2 = 0.001, except for Instacart and
Movielens where, due to their small number of tuples, we used
0.01 and 0.1, respectively.
Implementation— We implemented our optimal parameter calculations in Python application. Our sample generation logic read
required information, such as table size and join key frequencies,
from the database, and then constructed SQL statements to build
appropriate samples in the target database. We used Python to compute approximate answers from sample-based queries.
Join Approximation: Centralized Setting
Table 3 shows the sampling rates used by OPT for each dataset
and aggregate function in the centralized setting. For Synthetic,
the optimal parameters were some mixture of uniform and universe
Join Approximation with Filters
To study OPT’s effectiveness in the presence of filters, we used
S{uniform,uniform} and Instacart datasets with ǫ=0.01. We
added an extra column C to T1 in S{uniform,uniform}, with in-
OPT's Improvement
over Baselines
(log scale)
OPT
10000.00x
1000.00x
100.00x
10.00x
1.00x
0.10x
B1
B2
B3
B4
162%
B5
B6
150%
279%
195%
232%
2.2%
2.7%
0.7% 1.6%
0.9%
0.5% 0.6%
232%
195%
2.2%
2.3%
2.2%
156%
2.7%
181%
0.9% 1.6%
129%
1.0%
141% 129%
130%
0.6% 0.7%
128%
43%
279%
162%
228%
128%
2.7%
1.0%
2.7%
175%
129%
0.9%
1.6%
1.7%
160%
1.2%
1.0%
145%
0.8%
44% 0.6% 0.7%
0.6%
143%
43%
141%
57%
58%
89%
{Uniform, {Uniform, {Uniform, {Uniform, {Normal,
Uniform} Normal} Power1} Power2} Uniform}
{Normal,
Normal}
{Normal,
Power1}
85%
156%
181%
130%
{Normal,
Power2}
{Power1,
Uniform}
144%
113%
{Power1,
Power1}
57%
85%
58%
89%
120%
129%
141%145% 109%
{Power1,
Normal}
201%
217%
71%
128%
175%
160%
141% 129%
129%
43% 128%
143%
206%
150%
159%
195%
215%
159%
44% 43%
43%
{Power2,
Uniform}
97%
43% 51%
{Power2,
Normal}
51%
{Power2,
Power2}
OPT's Improvement
over Baselines
(log scale)
Figure 2: OPT’s improvement in terms of variance
over six
with synthetic
dataset (percentages are relative error).
OPTfor COUNT
B1
B2
B3 baselines
B4
B5
B6
10000.00x
1000.00x
100.00x
10.00x
1.00x
0.10x
161%
279%
195%
5.1%
6.5%
3.7%
0.9%
1.0%
0.9%1.0%
5.2%
6.4%
3.8%
1.0%
1.2%
1.0%1.1%
150%
136%
153%
141%
187%
137% 131% 147% 45%
60%
89%
177%
232%
159%
2.8%
3.5%
2.0%
0.9%
1.0%
46% 0.6% 0.7%
178%
228%
2.9%
206%
207%
195%
215%
166%
187%
166%
201%
216%
176%
3.5%
125%
134%
1.1%2.1%
176%
1.2%
161% 142%
0.8%
0.7%
142%
44%
140%
57%
85%
150%
135%
138%159%
131%
44% 130%
140%
208%
133%
145% 124%
131%
127%
122%
112%
145%
114%
51%
42%
44%
61%
64%
43% 34%
71%
97%
51%
34% 54%
{Uniform, {Uniform, {Uniform, {Uniform, {Normal, {Normal, {Normal, {Normal, {Power1, {Power1, {Power1, {Power2, {Power2, {Power2,
Uniform} Normal} Power1} Power2} Uniform} Normal} Power1} Power2} Uniform} Normal} Power1} Uniform} Normal} Power2}
Figure 3: OPT’s improvement in terms of variance for SUM over six baselines with synthetic dataset (percentages are relative error).
Table 4: Optimal sampling parameters for S{uniform,uniform}
for different distributions of the filtered column C.
Dist. of C
Uniform
Normal
Power law
COUNT
p
q
0.010 1.000
0.018 0.555
0.051 0.195
SUM
p
q
0.010 1.000
0.015 0.648
0.050 0.201
AVG
p
q
0.010 1.000
0.010 1.000
0.010 1.000
Table 5: Sampling parameters (p and q) of OPT using individual samples for different aggregates versus a combined sample
(S{normal,normal} dataset).
Scheme
OPT (individual)
OPT (combined)
COUNT
p
q
0.145 0.069
0.133 0.075
SUM
p
q
0.125 0.080
0.133 0.075
AVG
p
q
0.010 1.000
0.133 0.075
tegers in [1, 100], and tried three distributions (uniform, normal,
power law). For Instacart, we used the order hour of day column for filtering, which had an almost normal distribution. We
used an equality operator and chose the comparison value x uniformly at random. We calculated the average variance over all possible values of c.
Table 4 shows the sampling rates chosen by OPT, while Figure 7
shows OPT’s improvement over baselines in terms of average variance. Again, OPT successfully achieved the lowest average variance among all baselines in all cases, up to 10x improvement compared to the worst baseline. This experiment confirms that UBS
with OPT is highly effective for join approximation, even in the
presence of filters.
7.5
Combining Samples
We evaluated the idea of using a single sample for multiple queries
instead of generating individual samples for each query, as discussed in Section 6. Here, we use OPT (individual) and OPT (combined) to denote the use of one-sample-per-query and one-samplefor-multiple-queries, respectively. For OPT (combined), we considered a scenario where each of COUNT, SUM, and AVG is equally
likely to appear. Table 5 reports the sampling rates chosen in each
case. As shown in Figure 8, without having to generate an individual sample for each query, the variances of OPT (combined) were
only slightly higher than those of OPT (individual). This experiment shows that it is possible to create a single sample for multiple
queries without sacrificing too much optimality.
7.6
Stratified Sampling
We also evaluated SUBS for join queries with group-by. Here,
we used the S{normal,normal} dataset, and added an extra group
column G to T1 with integers from 0 to 9 drawn from a power law
distribution with α = 1.5. This time we did not randomize the
groups, i.e., G=0 had the most tuples and G=9 had the fewest. This
was to study SUBS performance with respect to the different group
sizes. As a baseline, we generated stratified samples for T1 on G
with kkey = 100, 000 and uniform samples for T2 with a 0.01
sampling budget. We denote this baseline as SS UF. For SUBS,
we used parameters that matched the sample size of SS UF, i.e.,
kkey = 100, ktuple = 100, 000. Figure 9 shows the variance of
query estimators for each of the 10 groups for different aggregations. As expected, SUBS with OPT achieved lower variances than
SS UF across all aggregates and groups with different sizes.
7.7
Overhead: Centralized vs. Decentralized
We compared the overhead of OPT in centralized versus decentralized settings, in terms of the sample creation time and resources,
such as network and disk. OPT should have a much higher overhead
in the centralized setting, as it requires full frequency information
of every join key value in both tables. To quantify their overhead
difference, we used Instacart and TPC-H, and created a pair
of samples for SUM in each case. Here, the aggregation type did
not matter, as the time spent calculating p and q was negligible
compared to the time taken by transmitting the frequency vectors.
As shown in Figure 10, we measured the time for statistics acquisition, sampling rate calculation, and sample table creation. Here,
the time taken by collecting the frequencies was the dominant factor. For Instacart, it took 65.16 secs from start to finish in
the decentralized setting, compared to 99.98 secs in the centralized
setting, showing 1.53x improvement in time. For TPC-H, it took
59.5 min in the decentralized setting, compared to 91.7 mins in the
centralized, showing a speedup of 1.54x.
We also measured the total network and disk I/O usage across
the entire cluster, as shown in Figure 11. For Instacart, compared to the decentralized setting, the centralized one used 3.66x
(0.9 → 3.29 MB) more network and 2.22x (7.59 → 16.9 MB)
more disk bandwidth. Overall, the overhead was less for TPC-H.
The centralized in this case used 1.38x (243.39 → 337.04 MB)
more network and 1.49x (519.03 → 776.58 MB) more disk bandwidth than the decentralized setting.
OPT's Improvement
over Baselines
(log scale)
OPT
B1
B2
B3
B4
B5
11.0%
1000.00x
4.7%
5.7%
100.00x
4.6%
1.7%
2.2%
5.8%
3.4%
3.1%
2.3%
1.8%
2.1%
1.2%
3.4%
24%
24% 24%
11% 17%
0.6%
95%
0.6%
42% 60% 101%
0.8%
101%
97%
0.6%
0.7%0.7%
0.6%0.6%
1.00x
0.3%
0.8%
0.7%
0.2%
0.2%
0.2%0.2%
0.3%
13%
0.3%0.3%
30%
0.9%
1.1%
1.2%
1.0%
0.7%
10.00x
B6
2.9%
12.6%
8.4%
11%
0.4%
0.2%
0.3%
0.2%
0.6%
0.9%
1.0%
21%
0.6%
0.2%
0.2%
0.2%
0.2%
0.2%0.2%
0.2%0.2%
33%
4%
0.3%
0.6%
0.04%
0.5%
0.1%
0.4%
0.5%0.5%0.1%0.1% 0.6%0.5%0.1%
4%
4%
4%
11%
8%
11% 10%
4%
5%
46%
0.05%
0.10x
{Uniform, {Uniform, {Uniform, {Uniform, {Normal, {Normal,
Uniform} Normal} Power1} Power2} Uniform} Normal}
{Normal,
Power1}
3%
{Normal, {Power1, {Power1, {Power1, {Power2, {Power2, {Power2,
Power2} Uniform} Normal} Power1} Uniform} Normal} Power2}
Figure 4: OPT’s improvement in terms of variance for AVG over six baselines with synthetic dataset (percentages are relative error).
B2
B3
B4
B5
B6
OPT
1.2%
1.6%
1.3%
10.00x
2.1%
1.6%
1.3%
1.0%
5.00x
1.0%
0.6%
0.6%
0.6%
0.6%
0.8%
0.8%
0.8%
0.8%
0.7%
0.4%
0.4%
0.4%
0.4%
0.00x
COUNT
SUM
AVG
B1
B2
2.00x
B3
6.18%
6.17%
1.50x
5.98%
5.91%
5.70%
1.00x
B4
5.96%
5.77%
B5
6.35%
OPT
1.01%
1.00%
1.04%
0.99%
1.00%
0.98%
6.52%
6.56%
6.42%
6.37%
6.26%
6.42%
B6
OPT's Improvement
over Baselines
(log scale)
B1
15.00x
OPT's
Improvement over
Baselines
OPT's
Improvement over
Baselines
OPT
0.98%
0.50x
COUNT
SUM
B1
10000.0x
1000.0x
100.0x
10.0x
1.0x
0.1x
AVG
B2
B4
B5
B6
0.65%
1.80%
1.48%
1.38%
0.52%
0.35%
1.02%
1.00%
0.04%
0.11%
0.09%
0.10%
0.08%
0.07%
0.02%
0.07%
0.07%
0.02%
0.07%
COUNT
(b) Movielens
(a) Instacart
B3
1.75%
0.02%
SUM
AVG
(c) TPC-H
Figure 5: OPT’s improvement in terms of variance over the baselines on benchmark datasets (percentages are relative error).
OPT (Decentralized)
OPT (Centralized)
1.00E+12
OPT (Centralized)
1.00E+22
1.00E+18
1.00E+14
1.00E+10
1.00E+04
1.00E+02
1.00E+00
1.00E-02
1.00E-04
ifo
rm
,
{U
n i Un i
fo
{U for
rm
m
ni
,
}
fo
rm N o r
m
,P
{N
ow al}
or
m er l
a
a
{N l, U w}
ni
or
{N ma form
l
or
m , No }
al
r
{P
ow , Po mal
}
w
er
er
l
{P law
, U aw}
ow
n
e
{P
ow r la ifor
w,
m
er
law No }
, P rma
ow
l}
er
la
w}
In
sta
ca
rt
{U
n
{U
n
{U
n
(a) COUNT
OPT (Decentralized)
1.00E+06
ifo
rm
,
{U
n i Un i
fo
{U for
rm
m
ni
,
}
fo
rm N o r
m
,P
{N
ow al}
or
e
rl
m
a
a
{N l, U w}
ni
o
{N rma form
l, N
or
}
m
or
al
{P
ow , Po mal
}
w
er
e
rl
{P law
, U aw}
ow
n
e
{P
ow r la ifor
w,
m
er
law No }
, P rma
ow
l}
er
la
w}
In
sta
ca
rt
1.00E+08
OPT (Decentralized)
1.00E+26
Variance (log scale)
Variance (log scale)
1.00E+16
ifo
rm
,U
{U
n
n
{U ifor ifor
m
ni
, N m}
fo
rm
or
m
,
{N Pow al}
or
m er l
aw
al,
{N
}
Un
o
i
{N rma form
l
or
m , No }
al
r
{P
ow , Po mal
}
w
er
er
l
a
law
{P
w,
ow
}
Un
e
{P
ow r la ifor
w,
m
er
law No }
, P rma
ow
l}
er
la
w}
In
sta
ca
rt
Variance (log scale)
OPT (Centralized)
1.00E+20
(b) SUM
(c) AVG
Figure 6: Variances of the query estimators for OPT in the centralized and decentralized settings.
B2
B3
B4
B5
B6
B1
B3
B4
B5
B1
B6
10.00x
8.00x
6.00x
4.00x
2.00x
0.00x
10.00x
8.00x
6.00x
4.00x
2.00x
0.00x
Uniform
Normal
Powerlaw
B3
B4
B5
B6
B1
Normal
Powerlaw
(b) SUM
B2
B3
B4
B5
B6
12.00x
10.00x
8.00x
6.00x
4.00x
2.00x
10.00x
8.00x
6.00x
4.00x
2.00x
0.00x
0.00x
Uniform
(a) COUNT
B2
12.00x
OPT's Improvement over
Baselines
OPT's Improvement over
Baselines
OPT's Improvement over
Baselines
B2
12.00x
OPT's Improvement over
Baselines
B1
12.00x
Uniform
Normal
(c) AVG
Powerlaw
COUNT
SUM
AVG
(d) Instacart
Figure 7: OPT’s improvement in terms of the estimator’s variance over six baselines in the presence of filters.
Variance
(log scale)
1.00E+18
OPT (individual)
OPT (combined)
1.00E+13
1.00E+08
1.00E+03
1.00E-02
COUNT
SUM
AVG
Figure 8: Variance of the query estimators for OPT (individual)
and OPT (combined) for the S{normal,normal} dataset.
This experiment shows the graceful tradeoff between the optimality of sampling and its overhead, making the decentralized variant an attractive choice for large datasets and distributed systems.
8. RELATED WORK
Online Sample-based Join Approximation— Ripple Join [29,43]
is an online join algorithm that operates under the assumption that
the tuples of the original tables are processed in a random order.
Each time, it retrieves a random tuple (or a set of random tuples)
from the tables, and then joins the new tuples with the previously
read tuples and with each other. SMS [36] speeds up the hashed
version of Ripple Join when hash tables exceed memory. Wander
Join [41] tackles the problem of k-way chain join and eliminates
the random order requirement of Ripple Join. However, it requires
an index on every join column in each of the tables. Using indexes, Wander Join performs a set of random walks and obtains a
non-uniform but independent sample of the join. Maintaining an
approximation of the size of all partial joins can help overcome the
non-uniformity problem [42, 65].
Offline Sample-based Join Approximation— AQUA [5] acknowledges the quadratic reduction and the non-uniformity of the output
when joining two uniform random samples. The same authors propose Join Synopsis [6], which computes a sample of one of the
tables and joins it with the other tables as a sample of the actual
join. Chaudhuri et al. [17] also point out that a join of independent
samples from two relations does not yield an independent sample
of their join, and propose using precomputed statistics to overcome
this problem. However, their solution can be quite costly, as it requires collecting full frequency information of the relation. Zhao et
OPT
SS_UF
OPT
SS_UF
OPT
1.00E+19
1.00E+12
1.00E+11
1.00E+10
1.00E+18
Variance
Variance (log scale)
Variance (log scale)
1.00E+13
1.00E+17
1.00E+16
2.00E+00
1.00E+00
0.00E+00
1.00E+15
0
1
2
3
4
5
6
7
8
9
SS_UF
1.00E+01
9.00E+00
8.00E+00
7.00E+00
6.00E+00
5.00E+00
4.00E+00
3.00E+00
0
1
(a) COUNT
2
3
4
5
6
7
8
9
0
(b) SUM
1
2
3
4
5
6
7
8
9
(c) AVG
Figure 9: Query estimator variance per group for for a group-by join aggregate using SUBS versus SS UF.
Sample Table Creation
Sampling Rate Calculation
Statistics Acquisition
100
50
Sample Table Creation
Sampling Rate Calculation
Statistics Acquisition
6000
Time (s)
Time (s)
150
0
4000
2000
0
Instacart
(Centrailized)
Instacart
(Decentralized)
TPC-H
(Centrailized)
(a) Instacart
TPC-H
(Decentralized)
(b) TPC-H
20
Network
Disk
15
10
5
0
Instacart
(Centrailized)
Instacart
(Decentralized)
(a) Instacart
Data Usage (MB)
Data Usage (MB)
Figure 10: Time taken to generate samples for Instacart and
TPC-H in centralized vs. decentralized setting.
1000
800
600
400
200
0
Network
Disk
TPC-H
(Centrailized)
TPC-H
(Decentralized)
(b) TPC-H
Figure 11: Total network and disk bandwidth used to generate
samples for Instacart and TPC-H.
al. [65] provide a better trade-off between sampling efficiency and
the join size upper bound. Hashed sampling (a.k.a. universe) [31]
is proposed in the context of selectivity estimation for set similarity queries. Block-level uniform sampling [16] is less accurate but
more efficient than tuple-level sampling. Bi-level sampling [19,30]
performs Bernoulli sampling at both the block- and tuple-level, as a
trade-off between accuracy and I/O cost of sample generation. Kamat and Nandi [37] use simple stratified sampling on join column
but with an objective function measuring the amount of randomness of the sample scheme, which shows improvement over simple
correlated sampling.
AQP Systems on Join— Most AQP systems rely on sampling and
support certain types of joins [5,8,15,27,38,42,51,55]. STRAT [15]
discusses the use of uniform and stratified sampling, and how those
can support certain types of join queries. More specifically, STRAT
only supports PK-FK joins between a fact table and one or more dimension table(s). BlinkDB [8] extends STRAT and considers multiple stratified samples instead of a single one. As previously mentioned, AQUA [5] supports foreign key joins using join synopses.
Icicles [27] samples tuples that are more likely to be required by
future queries, but, similar to AQUA, only supports foreign key
joins. PF-OLA [55] is a framework for parallel online aggregation. It studies parallel joins with group-bys, when partitions of
the two tables fit in memory. XDB [42] integrates Wander Join
in PostgreSQL. Quickr [38] does not create offline samples. Instead, it uses universe sampling to support equi-joins, where the
group-by columns and the value of aggregates are not correlated
with the join keys. VerdictDB [51] is a universal AQP framework that supports all three types of samples (uniform, universe,
and stratified). VerdictDB utilizes a technique called variational
subsampling, which creates subsamples of the sample such that it
only requires a single join—instead of repeatedly joining the subsamples multiple times—to produce accurate aggregate approximations. ApproxJoin [56] uses Bloom Filters in conjunction with
stratified sampling to efficiently produce a sample to the join when
relations are distributed across different nodes.
Join Cardinality Estimation— There is extensive work on join
cardinality estimation (i.e.,count(*)) in the database community [10, 26, 39, 40, 54, 59, 60, 62] as an important step of the query
optimization process for joins. Two-level sampling [18] first applies universe sampling to the join values, and then, for each join
value sampled, it performs Bernoulli sampling. However, unlike
our UBS scheme which applies the same rate to all keys, two-level
sampling uses a different rate during its universe sampling for each
join key. In other words, two-level sampling is a more complex
scheme with significantly more parameters than UBS (which requires only two parameters, p and q), and is thus less amenable to
efficient and decentralized implementation. Furthermore, two-level
sampling applies two different sampling methods, whereas bi-level
sampling [30] uses only Bernoulli sampling but at different granularity levels. End-biased sampling [26] samples each tuple with
a probability proportional to the frequency of its join key. Indexbased sampling [40] and deep learning [39] have also been utilized
to improve cardinality estimates.
Theoretical Studies— The question about the limitation of samplebased approximation of joins, to the best of our knowledge, has not
been asked in the theory community. However, the past work in
communication complexity on set intersection and inner product
estimation has implications for join approximation. In this problem, the Alice and Bob possess respectively two vectors x and y
and they wish to compute their inner product t = hx, yi without
exchanging the vector x and y. In the one-way model, Alice computes a summary β(x) and sends it to Bob, who will estimate hx, yi
using y and β(x). For this problem, [48] shows that any estimator
produced by s bits of communication has variance at least Ω(dt/s).
Estimating inner product for 0, 1 vectors is directly related to estimating SUM and COUNT for a PK-FK join. A natural question is
whether the join is still hard even if frequencies are all larger than
1. Further, the question of whether estimating AVG is also hard is
not answered by prior work.
9. CONCLUSION
Our goal in this paper was to improve our understanding of join
approximation using offline samples, and formally address some of
the key open questions faced by practitioners using and building
AQP engines. We defined generic sampling schemes that cover the
most common sampling strategies, as well as as their combinations.
Within these schemes, we (1) provided an information-theoretical
lower bound on the lowest error achievable by any offline sampling scheme, (2) derived optimal strategies that match this lower
bound within a constant factor, and (3) offered a decentralized variant that requires minimal communication of statistics across the
network. These results allow practitioners to quickly determine—
e.g., based on the distribution of the join columns—if joining offline samples will be futile or will yield a reasonable accuracy. We
also expect our hybrid samples to improve the accuracy of database
learning [52] and selectivity estimation [53] for join queries.
10.
ACKNOWLEDGEMENT
This research is in part supported by the National Science Foundation through grants 1553169 and 1629397.
11.
REFERENCES
[1] The instacart online grocery shopping dataset 2017.
https://www.instacart.com/datasets/
grocery-shopping-2017. Accessed: 2019-07-20.
[2] Security on-demand announces acquisition of Infobright
analytics & technology assets.
https://tinyurl.com/y6ctn4vs.
[3] TPC-H Benchmark. http://www.tpc.org/tpch/.
[4] Cloudlab. https://www.cloudlab.us, 2019.
[5] S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast
decision support system using approximate query answers.
In VLDB, 1999.
[6] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy.
Join synopses for approximate query answering. In
SIGMOD, 1999.
[7] Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet
Talwalkar, Michael Jordan, Samuel Madden, Barzan
Mozafari, and Ion Stoica. Knowing when you’re wrong:
Building fast and reliable approximate query processing
systems. In SIGMOD, 2014.
[8] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry
Milner, Samuel Madden, and Ion Stoica. BlinkDB: queries
with bounded errors and bounded response times on very
large data. In EuroSys, 2013.
[9] Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Anand P.
Iyer, Samuel Madden, and Ion Stoica. Blink and it’s done:
Interactive queries on very large data. PVLDB, 2012.
[10] Noga Alon, Phillip B Gibbons, Yossi Matias, and Mario
Szegedy. Tracking join and self-join sizes in limited storage.
Journal of Computer and System Sciences, 64, 2002.
[11] Noga Alon, Yossi Matias, and Mario Szegedy. The space
complexity of approximating the frequency moments. J.
Comput. Syst. Sci., 58, 1999.
[12] Albert Atserias, Martin Grohe, and Dániel Marx. Size
bounds and query plans for relational joins. SIAM J.
Comput., 42(4), 2013.
[13] Brian Babcock, Surajit Chaudhuri, and Gautam Das.
Dynamic sample selection for approximate query processing.
In VLDB, 2003.
[14] Stephen P. Boyd and Lieven Vandenberghe. Convex
Optimization. Cambridge University Press, 2014.
[15] Surajit Chaudhuri, Gautam Das, and Vivek Narasayya.
Optimized stratified sampling for approximate query
processing. TODS, 2007.
[16] Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava.
Effective use of block-level sampling in statistics estimation.
In SIGMOD, 2004.
[17] Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya.
On random sampling over joins. In SIGMOD, 1999.
[18] Yu Chen and Ke Yi. Two-level sampling for join size
estimation. In SIGMOD, 2017.
[19] Yu Cheng, Weijie Zhao, and Florin Rusu. Bi-level online
aggregation on raw data. In SSDBM, 2017.
[20] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.
Hellerstein, Khaled Elmeleegy, and Russell Sears.
Mapreduce online. In NSDI, 2010.
[21] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest,
and Clifford Stein. Introduction to algorithms. MIT press,
2009.
[22] Graham Cormode, Minos Garofalakis, Peter J Haas, and
Chris Jermaine. Synopses for massive data: Samples,
histograms, wavelets, sketches. Foundations and Trends in
Databases, 4, 2012.
[23] Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten
Binnig, and Tim Kraska. Vizdom: Interactive analytics
through pen and touch. PVLDB, 2015.
[24] Alin Dobra, Minos N. Garofalakis, Johannes Gehrke, and
Rajeev Rastogi. Processing complex aggregate queries over
data streams. In SIGMOD, 2002.
[25] Alin Dobra, Chris Jermaine, Florin Rusu, and Fei Xu.
Turbo-charging estimate convergence in dbo. PVLDB, 2009.
[26] Cristian Estan and Jeffrey F. Naughton. End-biased samples
for join cardinality estimation. In ICDE, 2006.
[27] Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan.
Icicles: Self-tuning samples for approximate query
answering. In VLDB, 2000.
[28] Deepak Goyal. Approximate query processing at
WalmartLabs.
https://fifthelephant.talkfunnel.com/
2018/43-approximate-query-processing.
[29] Peter J. Haas and Joseph M. Hellerstein. Ripple Joins for
Online Aggregation. In SIGMOD, pages 287–298, 1999.
[30] Peter J Haas and Christian König. A bi-level bernoulli
scheme for database sampling. In SIGMOD, 2004.
[31] Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, and
Divesh Srivastava. Hashed samples: selectivity estimators for
set similarity selection queries. PVLDB, 2008.
[32] F Maxwell Harper and Joseph A Konstan. The movielens
datasets: History and context. TIIS, 2016.
[33] Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang.
Online aggregation. In SIGMOD, 1997.
[34] Daniel G Horvitz and Donovan J Thompson. A
generalization of sampling without replacement from a finite
universe. Journal of the American statistical Association, 47,
1952.
[35] Dawei Huang, Dong Young Yoon, Seth Pettie, and Barzan
Mozafari. Joins on samples: A theoretical guide for
practitioners. https://arxiv.org/abs/1912.03443,
2019.
[36] Christopher Jermaine, Alin Dobra, Subramanian Arumugam,
Shantanu Joshi, and Abhijit Pol. A disk-based join with
probabilistic guarantees. In SIGMOD, 2005.
[37] Niranjan Kamat and Arnab Nandi. A unified
correlation-based approach to sampling over joins. In
Proceedings of the 29th International Conference on
Scientific and Statistical Database Management, Chicago,
IL, USA, June 27-29, 2017, pages 20:1–20:12, 2017.
[38] Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic,
Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and
Bolin Ding. Quickr: Lazily approximating complex adhoc
queries in bigdata clusters. In SIGMOD, 2016.
[39] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis,
Peter Boncz, and Alfons Kemper. Learned cardinalities:
Estimating correlated joins with deep learning.
arXiv:1809.00677, 2018.
[40] Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons
Kemper, and Thomas Neumann. Cardinality estimation done
right: Index-based join sampling. In CIDR, 2017.
[41] Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join:
Online aggregation via random walks. In SIGMOD, 2016.
[42] Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join
and XDB: online aggregation via random walks. TODS,
2019.
[43] Gang Luo, Curt J Ellmann, Peter J Haas, and Jeffrey F
Naughton. A scalable hash ripple join algorithm. In
SIGMOD, 2002.
[44] Barzan Mozafari. Approximate query engines: Commercial
challenges and research opportunities. In SIGMOD Keynote,
2017.
[45] Barzan Mozafari, Eugene Zhen Ye Goh, and Dong Young
Yoon. CliffGuard: A principled framework for finding robust
database designs. In SIGMOD, 2015.
[46] Barzan Mozafari and Ning Niu. A handbook for building an
approximate query engine. IEEE Data Eng. Bull., 2015.
[47] Barzan Mozafari, Jags Ramnarayan, Sudhir Menon, Yogesh
Mahajan, Soubhik Chakraborty, Hemant Bhanawat, and
Kishor Bachhav. SnappyData: A unified cluster for
streaming, transactions, and interactive analytics. In CIDR,
2017.
[48] Rasmus Pagh, Morten Stöckel, and David P. Woodruff. Is
min-wise hashing optimal for summarizing set intersection?
In PODS, 2014.
[49] Niketan Pansare, Vinayak R. Borkar, Chris Jermaine, and
Tyson Condie. Online aggregation for large mapreduce jobs.
PVLDB, 4, 2011.
[50] Yongjoo Park, Michael Cafarella, and Barzan Mozafari.
Visualization-aware sampling for very large databases.
ICDE, 2016.
[51] Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and
Junhao Wang. VerdictDB: universalizing approximate query
processing. In SIGMOD, 2018.
[52] Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and
Barzan Mozafari. Database Learning: Towards a database
that becomes smarter every time. In SIGMOD, 2017.
[53] Yongjoo Park, Shucheng Zhong, and Barzan Mozafari.
QuickSel: Quick selectivity learning with mixture models.
CoRR, abs/1812.10568, 2018.
[54] Theoni Pitoura and Peter Triantafillou. Self-join size
estimation in large-scale distributed data systems. In ICDE,
2008.
[55] Chengjie Qin and Florin Rusu. Pf-ola: a high-performance
framework for parallel online aggregation. Distributed and
Parallel Databases, 2013.
[56] Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros
Blanas, Ruichuan Chen, Christof Fetzer, and Thorsten
Strufe. Approxjoin: Approximate distributed joins. In
Proceedings of the ACM Symposium on Cloud Computing,
SoCC 2018, Carlsbad, CA, USA, October 11-13, 2018, pages
426–438, 2018.
[57] Sajjadur Rahman, Maryam Aliakbarpour, Hidy Kong, Eric
Blais, Karrie Karahalios, Aditya G. Parameswaran, and
Ronitt Rubinfeld. I’ve seen ”enough”: Incrementally
improving visualizations to support rapid decision making.
PVLDB, 2017.
[58] Hong Su, Mohamed Zait, Vladimir Barrière, Joseph Torres,
and Andre Menck. Approximate aggregates in oracle 12c,
2016.
[59] Arun Swami and K Bernhard Schiefer. On the estimation of
join result sizes. In EDBT, 1994.
[60] David Vengerov, Andre Cavalheiro Menck, Mohamed Zaı̈t,
and Sunil Chakkappen. Join size estimation subject to filter
conditions. PVLDB, 2015.
[61] Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. Continuous
sampling for online aggregation over multiple queries. In
SIGMOD, 2010.
[62] Wentao Wu, Jeffrey F Naughton, and Harneet Singh.
Sampling-based query re-optimization. In SIGMOD, 2016.
[63] Dong Young Yoon, Mosharaf Chowdhury, and Barzan
Mozafari. Distributed lock management with rdma:
Decentralization without starvation. In SIGMOD, 2018.
[64] Kai Zeng, Shi Gao, Barzan Mozafari, and Carlo Zaniolo. The
analytical bootstrap: a new method for fast error estimation
in approximate query processing. In SIGMOD, 2014.
[65] Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and
Ke Yi. Random sampling over joins revisited. In SIGMOD,
2018.