Subsampling Suffices For Adaptive Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Subsampling Suffices for Adaptive Data Analysis

Guy Blanc
Stanford University

September 22, 2023


arXiv:2302.08661v2 [cs.LG] 20 Sep 2023

Abstract
Ensuring that analyses performed on a dataset are representative of the entire population
is one of the central problems in statistics. Most classical techniques assume that the dataset
is independent of the analyst’s query and break down in the common setting where a dataset
is reused for multiple, adaptively chosen, queries. This problem of adaptive data analysis was
formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS,
2014).
We identify a remarkably simple set of assumptions under which the queries will continue to
be representative even when chosen adaptively: The only requirements are that each query takes
as input a random subsample and outputs few bits. This result shows that the noise inherent in
subsampling is sufficient to guarantee that query responses generalize. The simplicity of this
subsampling-based framework allows it to model a variety of real-world scenarios not covered by
prior work.
In addition to its simplicity, we demonstrate the utility of this framework by designing
mechanisms for two foundational tasks, statistical queries and median finding. In particular, our
mechanism for answering the broadly applicable class of statistical queries is both extremely
simple and state of the art in many parameter regimes.
Contents
1 Introduction 2
1.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 A mechanism for statistical queries . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 A mechanism for finding approximate medians . . . . . . . . . . . . . . . . . 6
1.3 Other related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Technical overview 7
2.1 Intuition behind the mutual information bound . . . . . . . . . . . . . . . . . . . . . 7
2.2 Average leave-many-out KL stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Concentration of U-statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Bounding the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 A self-reduction for boosting success probability . . . . . . . . . . . . . . . . . . . . . 12

3 Improvements upon the conference version 14

4 Notation and key definitions 14


4.1 Formal model of the analyst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Deterministic vs randomized analysts . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Deterministic vs randomized subsampling queries . . . . . . . . . . . . . . . . 17

5 Average leave-many-out KL stability and its connection to mutual information 17


5.1 Bounding the drop in m-conditional entropy using ALMOKL stability . . . . . . . . 19
5.2 Bounding mutual information using m-conditional entropy . . . . . . . . . . . . . . . 20

6 Bounding the stability of subsampling queries 22


6.1 Improving Hoeffding’s reduction theorem . . . . . . . . . . . . . . . . . . . . . . . . 25

7 From small mutual information to small bias 26

8 Boosting the success probability 29


8.1 Proof of Lemma 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.2 Proof of Lemma 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9 Applications 36
9.1 The statistical-query mechanism: Proof of Theorem 3 . . . . . . . . . . . . . . . . . 36
9.2 The median-finding mechanism: Proof of Theorem 4 . . . . . . . . . . . . . . . . . . 39

10 Acknowledgments 40

A The average leave-one-out KL stability of subsampling queries 42

1
1 Introduction
Data is a scarce and valuable resource. As a result, data analysts often reuse the same dataset
to answer multiple queries. Hardt and Ullman [HU14] as well as Dwork, Feldman, Hardt, Pitassi,
Reingold, and Roth [DFH+ 15c] initiated the study of adaptive data analysis which aims to give
provable guarantees that the query answers will have low bias, i.e. be representative of the full
population, even when a dataset is reused adaptively. Since then, there have been a number of
works exploring the adaptive reuse of data [DFH+ 15a, DFH+ 15b, SU15b, SU15a, BNS+ 16, RZ16,
RRST16, Smi17, FS17, FS18, FRR20, DK22].
Much prior work has focused on the design of mechanisms, a layer between the analyst and
dataset. In those works, analysts do not have direct access to the data. Instead, when they wish to
ask a query φ, they pass it to the mechanism. The mechanism answers φ without revealing too
much information about the dataset; e.g. by adding noise to the output of φ applied to the dataset.
[DFH+ 15a, DFH+ 15b, DK22, FS18, FS17, SU15a].
This work is motivated by a simple question.

How can we guarantee that the query responses will have low bias, even without an
explicit mechanism?

The purpose of asking this question is twofold. First, it models a reality in which data analysts
often do not explicitly use mechanisms to obfuscate query results before looking at them. Second,
if the assumptions we make on the queries are sufficiently easy to explain to an analyst, they are
actionable, as the analyst can keep these assumptions in mind when deciding how to analyze data.

Our approach. We show that as long as each query takes as input a random subsample of the
dataset and outputs to a small range, the results will have low bias. Quantitatively, our results
depend only on the size of the subsample and the number of possible outputs the query has. These
quantities are intuitive and easy to explain to a data analyst who may be interested in ensuring their
results have low bias. Furthermore many algorithms, such as stochastic gradient descent, already
subsample their data, in which case we guarantee generalization for free.
One interpretation of this framework is that it eliminates the need to design a noise distribution
for each task. Prior works design mechanisms to bound the bias by adding an appropriate amount
of noise to the true result before returning it (e.g. by adding a mean-0 Gaussian). Our work shows
that the noise inherent in subsampling suffices. It also extends to tasks where it is difficult to design
an appropriate noise distribution – for example, when the output of each query is categorical rather
than numerical.
As easy corollaries of this subsampling approach, we give simple mechanisms for two foundational
tasks, statistical queries and median finding, demonstrating the power of this framework. In
particular, our mechanism for answering the broad and influential class of statistical queries (SQs)
[Kea98, FGR+ 17] achieves state of the art accuracy in many parameter regimes. In addition to
their broad applicability, statistical queries have been the standard bearer by which we assess the
utility of approaches for adaptive data analysis since the early works of [DFH+ 15c, BNS+ 16]. Our
SQ mechanism had advantages beyond its accuracy: It runs in sublinear time, and its extreme
simplicity renders it broadly applicable in non-standard setting.

2
1.1 Main results
We show that an analyst can ask an adaptive sequence of subsampling queries without incurring
large bias.

Definition 1 (Subsampling query). For any sample S ∈ X n and function φ : X w → Y , the


subsampling query φ is answered by drawing x1 , . . . , xw uniformly without replacement from S, and
then providing the response y = φ(x1 , . . . , xw ).
In an abuse of notation, we use φ(S) denotes the distribution of y defined above. Similarly, for
any distribution D supported on X, we use φ(D) to denote the distribution of y ′ = φ(x′1 , . . . , x′w )
iid
when x′1 , . . . , x′w ∼ D.

We allow the analyst’s choice of queries to be adaptive: Formally, at each time step t ∈ [T ] :=
{1, 2, . . . , T }, the analysts selects a query φt : X wt → Yt which may depend on the previous responses
y1 , . . . , yt−1 and then receives the response yt ∼ φt (S). We begin with an informal result bounding
the sample size needed to ensure the results have low bias.

Theorem 1 (Subsampling responses have low bias). Suppose an analyst asks an adaptive
sequence of T subsampling queries, each mapping X w to Y , to a sample S ∼ Dn . As long as
p
n ≥ Ω(w T |Y |),

with high probability, all of the queries will have low bias.

Compare Theorem 1 to a naive approach which takes a fresh batch of w samples for each query.
Subsampling has a quadratically better dependence on the number of queries asked, T , than that
naive approach which requires n ≥ wT .
Informally, a subsampling query φ has low bias if the distributions φ(S) and φ(D) are “close.”
We’ll formalize and generalize Theorem 1 in Section 2.4. That generalization will, for example, allow
the analyst to choose a different domain size and range for each query.
The main ingredient in the proof of Theorem 1 is a bound on the mutual information between
the sample and the query responses.

Theorem 2 (Subsampling queries reveal little information). Suppose an analysts asks an adaptive
sequence of T subsampling queries to a sample S ∼ Dn where the tth query takes as input a
subsample of size wt and outputs to a range of size rt . Then, the mutual information between the
query responses and the sample is at most
hP i
T
4E t=1 wt (rt − 1)
I(S; (y1 , . . . , yT )) ≤
n
where y1 , . . . , yT are the query responses and the expectation is over the analyst’s choices for the
subsampling queries.

3
1.2 Applications
1.2.1 A mechanism for statistical queries
Our main application is an extremely simple and accurate mechanism for the broad class of
statistical queries. Statistical queries, introduced by Kearns [Kea98], are parameterized by a function
φ : X → [0, 1]. A valid answer to such a query is any value close to µφ (D) := Ex∼D [φ(x)]. Many
natural analyses can be cast as sequence of statistical queries. This includes most algorithms for
both supervised and unsupervised learning, such as least-squares regression, gradient-descent, and
moment based methods. We refer the reader to [FGR+ 17] for a more extensive list.

Input: A sample S ∈ X n and parameter k.


For each time step t ∈ [T ] := {1, 2, . . . , T },
1. Receive a query φt : X → [0, 1].
iid
2. Sample x1 , . . . , xk ∼ Unif(S) and, for each, sample a vote vi ∼ Bernoulli(φt (xi )).
3. Output yt := (v1 + · · · + vk )/k.

Figure 1: A mechanism for answering statistical queries using subsampling.

Theorem 3 (Accuracy of our mechanism for answering SQs). For any τ, δ > 0 and adaptively
chosen sequence of statistical queries φ1 , . . . , φT : X → [0, 1], with parameters
p !  
T ln(T /δ) ln(1/δ) ln(T /δ)
n := O and k := Θ , (1)
τ2 τ2

the mechanism of Figure 1, when given a sample S ∼ Dn and parameter k, with probability at least
1 − δ, answers all queries t ∈ [T ] to accuracy

|yt − µφ (D)| ≤ max(τ · stdφt (D), τ 2 ) (2)


p
where stdφ (D) := µφ (D)(1 − µφ (D)) ≤ 1.

The proof of Theorem 3 is a simple corollary of our subsampling framework: Each vote (vi in
Figure 1) is the output of a subsampling query φ : X → {0, 1} and so fits within our framework
with w = 1, and |Y | = 2.
In many settings, the bound of Theorem 3 is state of the art. When stdφ (D) = o(1), its accuracy
improves upon prior state of the arts, both from Feldman and Steinke.1

1. In [FS17], they gave a mechanism with the accuracy guarantees of Theorem 3, but requiring a
larger sample size of p !
T ln(1/δ) ln(T /τ δ)
n≥Ω .
τ2

It’s worth noting that both of these works use a more strict definition of std′φ (D) = Varx∼D [φ(x)]. If the range
1
p

of φ is {0, 1}, their definition and ours coincide. Otherwise, their definition can be smaller than ours.

4
In addition to requiring a larger sample size than that of Theorem 3, that mechanism is also
more complicated than ours. It starts by splitting the dataset into chunks each containing
1
τ2
points. When given a query, it first computes the average of that query on each chunk,
and then computes an approximate median of those averages via a differentially private
algorithm. The same mechanism actually solves the approximate median problem described
in Section 1.2.2.

2. In [FS18], they gave a mechanism with a slightly worse sample size2 than that of Theorem 3
when the failure probability, δ, is a constant. Their mechanism is also simple: For a sample
S and query φ, they compute φ(S) + ζ where ζ is a Gaussian with mean 0 and variance
that scales with stdφ (D). However, their dependence on δ is a multiplicative 1/δ factor,
exponentially worse than ours.
The non-variance-adaptive setting, where we only wish to bound |yt − µφ (D)| ≤ τ 2 regardless of
stdφ (D), is more well studied [DFH+ 15c, BNS+ 16, SU15b, SU15a, HU14, DK22]. The state of the
art was given recently by Dagan and Kur [DK22], who showed that a sample of size
p !
T ln(1/τ δ)
n=O
τ2

is sufficient in most parameter regimes. Their mechanism works by returning yt = φt (S) + ζ where
ζ is drawn from a very carefully constructed bounded distribution. Our mechanism has a slightly
better dependence on τ and a better accuracy when stdφ (D) is small, but slightly worse dependencies
on T and δ.

Advantages of our mechanism. Our mechanism has advantages beyond the quantitative bound
on the sample size needed for low bias.√ First, it naturally runs in sublinear time as answering each
query requires only looking at k ≈ n/ T of the points in S.
Furthermore, our mechanism easily extends to the setting where the analyst does not know
ahead of time how many samples, the parameter k in Figure 1, they want for a particular query.
Rather, they can sequentially sample votes v1 , v2 , . . . while continually updating an estimate for
µφ (D) and stop at any point. The bounds of Theorem 3 hold as long as the total number of votes,
summed over all queries, is not too large. Early stopping can appear naturally in practice. For
example,
1. The analyst may only desire accuracy ±τ , regardless of what stdφ (D) is. If, based on the first
k ′ < k samples, the analyst can determine that stdφ (D) is small, they can stop early as a
small stdφ (D) means fewer samples are needed to achieve the desired accuracy.

2. The analyst may wish to verify whether µφ (D) ≈ c for some value c. If after the first k ′ < k
samples, the average is far from c, the analyst can already determine that µφ (D) is far from
c. This setting has previously been studied in the influential work of [DFH+ 15a, DFH+ 15b]
which showed that there exists mechanisms that answer exponentially many such verification
queries, as long as all but a tiny fraction of the inequalities to verify are true. Our analysis
does not extend to exponentially many queries, but it can easily intertwine standard queries
with verification queries, with the later being cheaper.
2
p
Their sample size is a ln(T ) multiplicative factor worse than ours even when δ is constant.

5
1.2.2 A mechanism for finding approximate medians
We also consider a generalization of statistical queries, each of which map w inputs to some set
R ⊆ R. For such queries, we give a mechanism for determining an approximate median of the
distribution φ(D).

Definition 2 (Approximate median). For a distribution E with support in R, we say a value y is


an approximate median of E if,

min (Prx∼E [x ≤ y], Prx∼E [x ≥ y]) ≥ 0.4.3

One particular application for approximate median queries is, once again, for answering statistical
queries. Given an SQ φ : X → [0, 1], we can construct

φ′ (x1 , . . . , xw ) = E [φ(xi )].


i∈[w]

Since φ′ and φ have the same mean, and φ′ has a smaller variance, Chebyshev’s inequality implies

that any approximate median of φ′ will be within 2/ w standard deviations of µφ (D). As a result,
the mechanism of Figure 2 can give similar accuracy results as guaranteed in Theorem 3. The
sample size required is larger (by log factors), but in exchange, it provides better accuracy when
stdφ (D) is very small. In particular, when stdφ (D) < τ , the τ 2 term of Equation (2) dominates,
but the median-based mechanism does not incur it.

Input: A sample S ∈ X n split into k disjoint groups S1 , . . . , Sk each containing ≥ ⌊n/k⌋


elements.
For each time step t ∈ [T ],
1. Receive a query φt : X wt → Rt where Rt ⊆ R.
2. Perform binary search on the mechanism’s output yt ∈ Rt where, to determine whether
yt ≥ r, the following procedure is used.
(a) For each group i ∈ [k], sample zi ∼ φt (Si ) and set a vote to vi := 1[zi ≥ r].
(b) Determine yt ≥ r iff at least half of the votes are 1.
3. After ⌈log2 (|Rt |)⌉ steps of binary search have finished, a single value, yt = r, will be
determined. Output it.

Figure 2: The subsampling mechanism answering approximate median queries.

Theorem 4 (Accuracy of our mechanism for answering approximate median queries). For any
δ > 0, adaptively chosen sequence of queries φ1 : X w1 → R1 , . . . , φT : X wT → RT , and parameters
set to
 
 s   
T ln Rmax X T ln Rmax
n := Oln wmax wt · ln |Rt | and k := O ln
δ δ
t∈T

3
All of our results hold as long as this 0.4 is 0.5 − ε for any fixed ε > 0.

6
where wmax and Rmax are upper bounds on wt and |Rt | respectively, the mechanism of Figure 2,
when given a sample S ∼ Dn and parameter k, with probability at least 1 − δ, successfully outputs
yt that is an approximate median of φt (D) for all t ∈ [T ].

Feldman and Steinke also give a mechanism for answering approximate median queries [FS17].
Their mechanism needs a sample size of
   
T Rmax p 2
n ≥ Ω ln ln(1/δ)T wmax .
δ

Our sample size bound is similar to theirs in the pessimistic settings where wt ≈ wmax for all t, with
slight improvements on some of the other dependencies. For example, we have a linear dependence
on ln(1/δ), whereas they have a ln(1/δ)3/2 dependence.
Most interestingly, our mechanisms and that of [FS17] is fairly similar – both rely on splitting
the dataset and, roughly speaking, computing an approximate median of the queries’ value on each
group – but the analyses are wholly different. Their analysis is based on differential privacy. In
contrast, Theorem 4 is a simple corollary of our subsampling framework. Indeed, it’s not difficult to
show that our mechanism does not satisfy standard (ε, δ) differential privacy with strong enough
parameters to give a sample size bound close to that of Theorem 4.

1.3 Other related work


Subsampling has been thoroughly explored in the context of privacy amplification (see e.g. [BBG18,
ZW19] or the book chapter [Ste22]): if A is a differentially private algorithm, running A on a random
subset of the data gives an algorithm with even better privacy parameters. Given the previous
applications of differential privacy to adaptive data analysis, this seems like a natural starting
point for our work. However, such an approach is not sufficient to analyze subsampling queries.
Indeed, subsampling queries do not necessarily satisfy (ε, δ)-differential privacy with sufficiently
good parameters to give useful bounds on the bias.
Fish, Reyzin, and Rubinstein explored the use of subsampling to speed up classical mechanisms
for adaptive data analysis [FRR20]. For example, their mechanism for answering a statistical query
φ, computes φ on a random subsample of the data and adds Laplacian noise to that result. This
allows them to retain the accuracy guarantees of prior mechanisms that added Laplacian noise
[BNS+ 16] while also running in sublinear time. In contrast, our work shows that subsampling alone
is sufficient, and achieves sample size bounds that improve upon prior work.

2 Technical overview
2.1 Intuition behind the mutual information bound
We begin with Theorem 2, bounding the mutual information between the sample S ∼ Dn and
query responses y := (y1 , . . . , yT ). Every time a new query response is revealed, yt ∼ φt (S), the
distribution of S conditioned on that response shifts. Our analysis will track a ”potential” of that
distribution. The most direct quantity to track would be the entropy, as the mutual information
can be written as

I(S; y) := H(S) − H(S | y) = H(S) − ′ E H(S | y = y ′ )


 
(3)
y ∼Dy

7
where Dy is the marginal distribution of y. Therefore, if we could show that regardless of the
distribution of S conditioned on the prior responses y1 , . . . , yt−1 , that the expected drop in entropy
of S by learning yt is at most O(w|Y |/n), then the mutual information between S and y is at most
O(T w|Y |/n). Such a bound holds, but only when S comes from a product distribution.
Fact 2.1 (Mutual information when the sample comes from a product distribution). For any
φ : X w → Y and S sampled from a product distribution on X n ,
I(S; φ(S)) ≤ O(w|Y |/n).
Recall that, at the start, S is sampled from Dn which is a product distribution. Unfortunately,
as we receive query responses, the distribution of S conditioned on those responses shifts and can
become non-product, and Fact 2.1 is false without the assumption that S is product.
Example 5 (Fact 2.1 is false for non-product distributions). Let X = {0, 1} and S ∈ X n be either
the all 1s vector or all 0s vector, each with equal probability. For φ : X 1 → {0, 1} being the identity
function and y ∼ φ(S), we see that4 H(S) = ln 2, but after conditioning on either y = 1 or y = 0,
S is fully determined and so has entropy 0. Therefore, I(S, y) = ln 2, a factor of n larger than the
bound in Fact 2.1.
A key observation about Example 5 is that while y reveals much information about S, it also
makes S “more product”: Before conditioning on y, S has a non-product distribution, but after
conditioning on any choice for y, it becomes a product distribution. In light of this observation, we
could hope that there is a trade-off between how a query response affects the entropy of S and the
“productness” of S. This suggests an amortized analysis. While query responses that drastically
decrease the entropy of S can happen, perhaps they need to be proceeded by many queries that
decrease the “productness” of S and are therefore rare.
Indeed, we use such an amortized analysis. To do so, we track the “half-conditional entropy” of
S. Assuming, for ease of exposition, that n is even, we define,
Proposition 5.4 H(S) − EJ [I(SJ ; S−J )]
Hhalf (S) := E [H(SJ | S−J )] =
[n]
J∼(n/2 ) 2

where J is a uniform size-(n/2) subset of [n], and SJ and S−J are the two disjoint subsets of S
indexed by J and its compliment respectively. The term I(SJ ; S−J ) measures how much knowing
half the elements of S affects the distribution of the other half of elements and is natural measure
of “productness.” Fortunately, we find that learning the response of a subsampling query cannot
change the half-conditional entropy of the sample by much.
Lemma 2.2 (Bounding expected drop in half-conditional entropy). For any φ : X w → Y and
random variable S supported on X n (including non-product distributions),
2w(|Y | − 1)
Hhalf (S) − Hhalf (S | φ(S)) ≤ .
n
Finally, as mutual information is nonnegative, we can lower bound H(S | y) ≥ 2Hhalf (S | y). As
a result, by showing that the half-entropy of S does not decrease much by conditioning on a single
query responses, we can conclude that its entropy also does not decrease much after conditioning on
all of the query responses and bound the mutual information between S and y.
4
We compute entropy using the natural base.

8
2.2 Average leave-many-out KL stability
To prove Lemma 2.2, bounding the drop in conditional-half entropy upon learning a new query
response, we introduce the following notion of algorithmic stability.

Definition 3 (Average leave-many-out KL stability). A randomized algorithm M : X n → Y is


ALMOKL stable with parameters (m, ε) if there is a randomized algorithm M′ : X n−m → Y such
that, for all samples S ∈ X n ,

dKL M(S) M′ (SJ ) ≤ ε


 
E
[n]
J∼(n−m )

where dKL (·∥·) is the KL-divergence and SJ indicates the size-(n − m) subset of S specified by the
indices within J.

This definition generalizes Feldman and Steinke’s notion of ”Average leave-one-out KL stability,”
which corresponds to ALMOKL stability with m = 1 [FS18]. At first glance, our definition seems
harder to satisfy because we must predict the behavior of M(S) when a large fraction of points
from S are removed, rather than when just a single point is removed. However, our analysis also
allows for a larger KL divergence in the case where m is large – specifically, the goal becomes to
minimize the ratio ε/m. Indeed, for subsampling queries, we find using m ≈ n/2 gives a stronger
result and simpler analysis than m = 1 (see Section 3 for more details).
We show that ALMOKL stability can be used to upper bound the drop in m-conditional entropy,
a generalization of half-conditional entropy.

Lemma 2.3 (ALMOKL stability bounds the drop in m-conditional entropy). For any (m, ε)-
ALMOKL stable M : X n → Y and random variable S supported on X n (including non-product
distributions),
Hm (S) − Hm (S | M(S)) ≤ ε
where the m-conditional entropy is defined as

Hm (S) := E [H(SJ | S−J )].


J∼([n]
m)

Based on Lemma 2.3, if we can bound the ALMOKL stability of each individual subsampling
query, we can bound the drop in m-conditional entropy after a (potentially adaptive) sequence of
subsampling queries. For this to be useful, we connect m-conditional entropy to mutual information.

Lemma 2.4 (m-conditional entropy bounds mutual information). For any random variables S, y
where the distribution of S is product on X n and m ≤ n,
n
I(S; y) ≤ · (Hm (S) − Hm (S | y)).
m
We note that Lemma 2.4 does require the starting distribution of S to be product, but we
can apply it even if the intermediate distributions of S conditioned on some of the responses is
non-product. This is because Lemma 2.3 does not require productness, so we can easily bound the
drop in m-conditional entropy of the entire sequence of adaptively chosen queries.

9
Feldman and Steinke also show how to bound the mutual information in the case of m = 1
[FS18]. The techniques used to prove Lemmas 2.3 and 2.4 are similar to theirs, appropriately
generalized for m > 1. Our main contributions to this notion of stability is the idea to generalize
the definition to the m > 1 case as well as a unique presentation, particularly the introduction of
m-conditional entropy.
The last ingredient in the proof of Theorem 2 is a bound on the ALMOKL stability of a single
subsampling query.

Lemma 2.5. For any function φ : X w → Y and m ≤ n, the subsampling query φ is (m, ε)-ALMOKL
stable for
w(|Y | − 1)
ε := .
n−m+1
As expected, smaller m gives a better bound on ε. However, smaller m increases the factor
of n/m in Lemma 2.4. We set m = ⌈n/2⌉ to optimize the final mutual information bound which
recovers Theorem 2.

2.3 Concentration of U-statistics


At the heart of the proof of Lemma 2.5 is a generalization of a classic inequality by Hoeffding.

Theorem 6 (Hoeffding’s reduction theorem [Hoe63]). Consider any finite population X ∈ Rn


and integer m ≤ n. Let x1 , . . . , xm be uniform samples from X chosen without replacement, and
y1 , . . . , ym be uniform samples from X chosen with replacement. Then, for any continuous and
convex f : R → R,      
1 X 1 X
Ef  xi  ≤ Ef  yi .
m m
i∈[m] i∈[m]

Theorem 6 give a simple approach for proving concentration inequalities on the sum of random
λx is convex, it implies that the moment
P replacement. Since x 7→ e P
variables sampled without
Pfunction of i xi P
generating is upper bounded by that of i yi . Therefore, Chernoff bounds that
apply to i yi also apply to i xi .
To prove Lemma 2.5, we generalize Hoeffding’s reduction theorem. It exactly corresponds to the
w = 1 case of the following theorem.

Theorem 7 (Extension of Hoeffding’s reduction theorem to U-statistics). Consider any finite


population X ∈ Rn and φ : X w → R. For any convex f : R → R and integer m ≤ n
  
f  1
X
E [f (µφ (XJ ))] ≤ E φ(Ti )
[n]
J∼( ) iid (w)
T1 ,...,T ∼ X
k
m k i∈[k]

where k = ⌊m/w⌋ and T ∼ X (w) indicates that T contains w elements from X sampled uniformly
without replacement.

In the statistics literature, the random variable µφ (SJ ) is called the U-statistic of order m
with kernel φ(·). This quantity was introduced in [Hoe48] and since then has received extensive
study (see, for example, the text book [KB13]). Despite that extensive study, to the best of our

10
knowledge, no inequality of the form given in Theorem 7 is known. The closest we are aware of is
the recent work [AKW22] which showed that µφ (SJ ) satisfies appropriate variants of Hoeffding’s
and Bernstein’s concentration inequalities. Such concentration inequalities are easily corollaries
of Theorem 7, which can be used to give a suitable bound on the moment generating function of
µφ (SJ ). Indeed, our proof of Theorem 7 borrows ideas from and simplifies [AKW22]’s proof.
We use Theorem 7 to prove Lemma 2.5. Given the ubiquity of U-statistics, it may also be of
independent interest.

2.4 Bounding the bias


Mutual information is known to bound bias in a variety of contexts (see e.g. [RZ16, FS18]). Similarly,
we use the mutual information bound of Theorem 2 to bound the bias of responses to subsampling
queries. The purpose of this subsection will be to formalize and explains our bias bounds, and we’ll
defer the (mostly standard) proofs to Section 7. We begin by formalizing Theorem 1.
Theorem 8 (Formal version of Theorem 1). For any distribution D over domain X and analyst
making a series of adaptive queries φ1 , . . . , φT : X w → Y to a sample S ∼ Dn ,
" #  2 
2 w T |Y | w log(T |Y |)
E sup (φt (S)(y) − φt (D)(y)) ≤ O + . (4)
t∈[T ],y∈Y n2 n

where the expectation is both over the sample S and the analysts choice of queries, and the notation
D(x) denotes Prx∼D [x = x].
The sample size in Theorem 1 is set so that the right hand side of Equation (4) is an arbitrarily
small constant. More generally, by Markov’s inequality, with probability at least 0.9, for all t ∈ [T ]
and y ∈ Y p r !
(n) (dist) w T |Y | w log(T |Y |)
Pr[φt (S)(y) − φt (D)(y) ≤ O + .
n n

The second term, Õ( w 5 φ : Xw →


p
n ), quantifies the inherent bias in a sample: There are queriesp
(n)
{0, 1}, for which, even with a fresh sample S, ES∼Dn [|φ (S)(1) − φ (dist) (D)(1)|] = Θ( w
n ).
n
The first term therefore quantifies the extra bias. When T |Y | ≤ w , that first term is dominated
by the inherit bias. The guarantee than slowly degrades from the inherit bias to vacuous as T |Y |
n n2
varies between w and w 2 . In contrast, a naive approach which uses disjoint samples for each query
n
work for T ≤ w but cannot move beyond that regime.
Our next theorem generalizes Theorem 8 in a number of ways. First, we disconnect the test
queries from the first T queries the analyst asks. After receiving the response y1 , . . . , yT , the
analyst chooses any set of test queries ψ1 : X v1 → [0, 1], . . . , ψm : X vm → [0, 1] for which we
bound |Ey∼ψ(S) [y] − Ey∼ψ(D) [y]|. To recover Theorem 8, we define a test query ψ(x1 , . . . , xw ) :=
1[ψt (x1 , . . . , xw ) = y] for each t ∈ [T ] and y ∈ Yt . The following notation will be convenient.
Definition 4 (Expectation and variance of a query). For a query φ : X w → Y ⊆ R and sample
S ∈ X n , we use the notation µφ (S) as shorthand for Ey∼φ(S) [y] and Varφ (S) as shorthand for
Vary∼ψ(S) [y]. Similarly, for a distribution D over domain X, we define µφ (D) := Ey∼φ [y] and
Varφ (D) := Vary∼ψ(S) [y]
5
One such example, for D = Unif({−1, 1}), is the query φ(x1 , . . . , xw ) = 1[x1 + · · · + xw ≥ 0].

11
Intuitively, when ψ(D) concentrates tightly, it is easier to give a good estimate of µψ (D). The
second improvement of Theorem 9 is that it gives a variance-dependent error guarantee that improves
when Varψ (D) is small. In contrast, Theorem 8 uses the pessimistic bound that Varψ (D) ≤ 1.

Definition 5 (Error of a query). For any ψ : X w → [0, 1], distribution D over X and sample
S ∈ X n , we define

∆2
 
1
error(ψ, S, D) := · min ∆, where ∆ := |µψ (S) − µψ (D)|.
w Varψ (D)
p
If error(ψ, S, D) ≤ ε, then ∆ ≤ max(wε, wε Varψ (D)). When Varψ (D) = 1, the second term
or the trivial bound of ∆ ≤ 1 dominates, but a lower VarD (ψ) can lead to a substantially better
bound on ∆.
Lastly, we allow the analyst to choose a different domain and range size for each query –
As a function of the responses y1 , . . . , yt−1 , the analyst chooses wt , Yt and a subsampling query
φ : X wt → Yt . Our bound will depend on the expected total “cost” of queries the analyst asks,
where the “cost” of a query φ : X w → Y is w|Y |.

Theorem 9 (Generalization of Theorem 8). For any distribution D over domain X and analyst
that makes a series of adaptive queries φ1 : X w1 → Y1 , . . . , φT : X wT → YT to a sample S ∼ Dn
and then chooses a collection of tests |Ψ| ≤ m,
" # P  !
E t∈T wt |Yt | log m + 1
E sup error(ψ, S, D) ≤ O +
ψ∈Ψ n2 n

where the expectation is both over the sample S ∼ Dn and the analyst’s decisions.

2.5 A self-reduction for boosting success probability


As is typical of mutual information-based bounds, Theorem 9 only guarantees low bias in expectation.
When wt = 1 for all t ∈ [T ], we further show a guarantee that holds with an exponentially small
failure probability.

Theorem 10 (Improved dependence on failure probability). P In the setting of Theorem 9, if the


analyst’s choices of queries satisfy wt = 1 for all t ∈ [T ] and t∈[T ] |Yt | ≤ b almost surely, then, for
any failure probability δ > 0,
"   #
b 1
Pr sup error(ψ, S, D) ≥ O ln(m/δ) · + ≤ δ.
ψ∈Ψ n2 n

The case of w = 1 is particularly important for two reasons.

1. It is sufficient for our application of answering statistical queries, a widely-applicable query


class, given in Section 1.2.1. Indeed, one way to characterize statistical queries are precisely
those queries φ : X n → [0, 1] for which an unbiased and bounded estimator of µφ (S) can be
computed given a single x ∼ Unif(S). Our mechanism for answering statistical queries simply
averages many of these unbiased estimators.

12
2. One way to answer a query with w ≥ 2 is to cast a sample of n points from D as a sample of
⌊n/w⌋ points each drawn from Dw . By doing so, each query φ : X w → Y can be answered by
looking at one “grouped point,” and so Theorem 10 gives a high probability guarantee. We
conjecture that such grouping is unnecessary and that a variant of Theorem 10 would directly
hold without the restriction that wt = 1 for all t ∈ [T ]. That said, the proof breaks in this
setting, and so we consider extending it to be an intriguing open problem.

The high probability guarantee of Theorem 10 does not follow from only a mutual information
bound. Instead, we use mutual information to achieve a constant failure probability and separately
prove a reduction from the low failure probability case to that of constant failure probability.

Lemma 2.6 (Informal version of Lemma 8.1). In the context of Theorem 9, suppose that the analyst
always asks subsampling queries φ : X wt → Yt where wt = 1 for all t ∈ [T ]. Any low-bias guarantee
that holds with constant probability for a sample size of n holds with probability 1 − 2Ω(−N/n) given a
sample of size N > n.

The starting point of Lemma 2.6 is the following observation. Given a sample of size S ∈ X N :=nk ,
the answer of a query y = φ(S) where φ : X 1 → Y is equivalent to the answer y ′ = φ(Si ) where
S1 , . . . Sk ∈ X n are equal-sized partitions of S and i ∼ Unif([k]), as depicted in Figure 3.

Figure 3: An illustration of why, for φ : X 1 → Y , the distribution of φ(S) is equivalent to φ(Si )


where S1 , . . . , Sk are partitions of S and i ∼ Unif([k]).

The hypothesis of Lemma 2.6 says that, if we look at a single i ∈ [k] in isolation, the probability
that Si is biased is at most a constant, say 0.1. Therefore, if these events were independent, the
probability a large fraction, say 0.2k, of the disjoint subsamples are biased is 2−Ω(k) .
However, because the analyst can choose which query to ask a group Si as a function of
responses from some other group Si′ , the events indicating whether each group is biased need not be
independent. To handle this we generalize a direct product theorem of Shaltiel that was originally
applied to fair decision trees [Sha04]. That generalization, given in Lemma 8.4, shows that while
those events need not be independent, they behave no worse than if they were.
Using the above direct product theorem, we determine that probability a constant fraction of
S1 , . . . , Sk is 2−Ω(k) . Finally, we connect this to the bias of S: Since each Si is a subsample of S,

13
if S were biased, it is likely that many of S1 , . . . , Sk are also biased. This allows us to conclude
that the probability S is biased is not much larger than the probability many of S1 , . . . , Sk are also
biased.

3 Improvements upon the conference version


In the conference version of the paper [Bla23], Theorem 2 was weaker by a multiplicative log n
factor, which similarly affects Theorems 1, 8 and 9. As a result, the statistical query mechanism
described in Figure 1 was not known to be state of the art. To get around this, the conference
version showed that a small amount of noise could be added to each query response to remove
this log n factor: Specifically, we replace each subsampling query φ : X w → Y with φ′ : X w → Y
where φ′ (x1 , . . . , xw ) outputs Unif(Y ) with a small, carefully chosen probability, and otherwise
φ(x1 , . . . , xw ). It was therefore able to recover the quantitative parameters of Theorem 3 but with
a more complicated mechanism.
To remove this multiplicative log n factor, we revamped many of the proofs, including all of the
techniques described in Sections 2.1 to 2.3. Most notably, this involved developing a new notion of
stability. The conference version used Feldman and Steinke’s average leave-one-out KL stability
[FS18], which this version generalizes to average leave-many-out KL stability (Definition 3). In
Appendix A, we show that a bound using Feldman and Steinke’s framework directly necessarily has
this additional log n factor. Interestingly, in addition to giving a better quantitative result, this new
version of stability leads to simpler and substantially shorter proofs.
Beyond strengthening our result, we hope the new ingredients developed in this version will have
independent interest. This stability notion, and our new generalization of Hoeffding’s reduction
theorem Theorem 7 may have other applications. Furthermore, much of the presentation, particularly
Section 2, has been revamped to provide more intuition.

4 Notation and key definitions


Indexing our sample For a natural number n, we use [n] to denote the set {1, . . . , n}. For a tuple
S ∈ X n we use the notation Si to indicate the ith element of S, and S−i := (S  1 , . . . , Si−1 , Si+1 , . . . , Sn )
S
denotes the remaining n − 1 elements. For w ≤ n, we use the notation w to indicate the set of all
n ′ S
 
w unordered size-w multisets S that are contained within S. For any set of indices J ∈ w , we
use SJ := (SJ1 , . . . , SJw ) to denote the elements of S indexed by J, and S−J := SJ for the elements
indexed by its compliment. We further use S (w) to indicate the multiset of all n(n − 1) · · · (n − w + 1)
ordered size-w tuples contained within S, i.e. all (x1 , . . . , xw ) such that x1 ∈ S, x2 is in S with x1
removed, and so on.

Random variables and distributions. We use boldfont to denote random variables and
calligraphic font to denote distributions (e.g. x ∼ D). The notation x ∼ S for a (multi)set or
tuple S is shorthand for a uniform sample from S, x ∼ Unif(S). For a distribution D, we will use
iid
x1 , . . . , xw ∼ D or equivalently x ∼ Dw to denote that x1 , . . . , xw are independent and distributed
iid
according to D. For example, x1 , . . . , xw ∼ S are w samples chosen uniformly with replacement
from S, whereas y1 , . . . , yw ∼ S (w) are w samples chosen uniformly without replacement. For a
distribution D over domain X and element x ∈ X, we use D(x) to denote the probability mass

14
function of D evaluated at x. For notational convenience, all domains and distributions will be
discrete, though our results could be extended to the setting where distributions are continuous.6

KL divergence, entropy, mutual information and their properties

Definition 6 (Kullback-Leibler (KL) Divergence). For distributions D, E supported on the same


domain, the KL divergence between D and E is defined as,
  
D(x)
dKL (D∥E) := E log .
x∼D E(x)

Definition 7 (Entropy and conditional entropy). The entropy of a random variable x ∼ Dx is


defined as   
1
H(x) := E ln .
x∼Dx D(x)
For a second random variable y with marginal distribution Dy , the conditional entropy is defined,

H(x | y) := ′ E H(x | y = y ′ ) .
 
y ∼Dy

One of the many convenient properties of entropy is that it is additive.

Fact 4.1 (Additivity of entropy). For any random variables x, y,

H((x, y)) = H(x) + H(y | x) = H(y) + H(x | y).

Definition 8 (Mutual information). For random variables x, y jointly distributed according to a


distribution D, let D(x) and D(y) be the marginal distributions of x and y respectively. The mutual
information between x and y is defined as

I(x; y) := H(x) − H(x | y) = H(y) − H(y | x) = I(y; x),

which is equivalent to the following

I(x; y) = dKL (D∥D(x) × D(y)) = E[dKL (D(x | y = y)∥D(y))].


y

A useful fact about KL divergence and mutual information is that they are nonnegative.

Fact 4.2 (Non-negativity of KL divergence and mutual information). For any distributions D and
E on the same domain,
dKL (D∥E) ≥ 0.
As a consequence, for any random variables x and y,

I(x; y) ≥ 0.

As an easy corollary, we obtain the following.


6
In more detail, by keeping all distribution discrete, we can directly reason about drops in entropy. This eases
notation and, we hope, gives more intuition to the reader. In the continuous setting, each of these drops in entropy
can be translated to mutual information.

15
Corollary 4.3 (Conditioning cannot increase entropy). For any random variables x, y,

H(x | y) ≤ H(x).

We’ll also use conditional mutual information.

Definition 9 (Conditional mutual information). For random variables x, y, z jointly distributed,


the mutual information of x and y conditioned on z is

I(x; y | z) = ′ E I((x | z = z ′ ); (y | z = z ′ )
 
z ∼Dz

where Dz is the marginal distribution of z.

4.1 Formal model of the analyst

Input: A sample S ∈ X n not known to the analyst.


For each time step t ∈ [T ], the analyst:
1. Selects a query φt : X wt → Yt which can depend on the previous responses y1 , . . . , yt−1 .
2. Receives the response yt ∼ φt (S).

Figure 4: An analyst asking an adaptive sequence of subsampling queries.

Figure 4 summarizes the interaction of the analyst with a sample S ∈ X n . Formally, we model
the analyst A as a function mapping a time step t ∈ [T ], previous query responses y1 , . . . , yt−1 , and
a source of randomness z ∼ Z to a subsampling query φt . After the T th step, the analyst outputs
a series of test queries ψ1 : X v1 → {0, 1}, ψm : X vm → {0, 1}, also as a function of y1 , . . . , yT and
z. With this formal description of the analyst, we can give the following lengthy but fully explicit
description of Theorem 9.

Theorem 11 (Theorem 9 restated with a formal model of the analyst). For any analyst A and
distributions D, Z, draw S ∼ Dn and z ∼ Z. For each t ∈ [T ], let φt = A(t, y1 , . . . , yt−1 , z) and
(n)
yt ∼ φt (S). Then, for ψ1 , . . . , ψm = A(T, y1 , . . . , yT , z),
P  !
E t∈T cost(φt ) log m + 1
E[ sup error(ψi , S, D)] ≤ O + (5)
i∈[m] n2 n

where cost(φ : X w → Y ) := w · |Y |.

4.1.1 Deterministic vs randomized analysts


We will restrict our attention to deterministic analysts, where the analysts output is a deterministic
function of the previous responses. Deterministic analysts do not take in a source of randomness
(previously denoted z ∼ Z). Through the following simple argument, this is without loss of generality.

16
Lemma 4.4 (Deterministic analysts are as powerful as randomized analysts). Theorems 1, 2 and 8
to 10 hold for deterministic analysts, they also hold for randomized analysts.
Proof. Let A be a randomized analyst. We can think of it as a mixture of deterministic analysts: A
first draws z ∼ Z and then executes the deterministic strategy Az := A(·, z). We begin with the
reduction for Theorems 1, 8 and 9,

E[Error with A as the analyst] = E[E[Error with Az as the analyst]]


z
≤ sup E[Error with Az as the analyst].
z

The left-hand side of Equation (5) when A is the analyst is the expectation of the same quantity
when Az is the analyst. Therefore, if it is small for all deterministic analysts, it is also small for all
randomized analysts.
For Theorem 10, the same argument holds where ”Error with A as the analyst” is replaced
by ”Failure probability with A as the analyst.” Finally, for Theorem 2 which bounds the mutual
information of S and y, let yz be the distribution of y conditioned on the analyst being Az . Then,
 
E[I(S; yz )] − I(S; y) = H(S) − E[H(S | yz )] − (H(S) − H(S | y))
z z
= H(S | y) − H(S | y, z)
= I((S | y); z) ≥ 0. (Fact 4.2)

Therefore, upper bounding the mutual information for deterministic analysts in Theorem 2 suffices
to upper bound it for randomized analysts as well.

4.1.2 Deterministic vs randomized subsampling queries


The analyst may be interested in asking a subsampling query φ : X w → Y that itself incorporates
randomness. In this case, we can model φ as a deterministic function φ′ which also takes in a source
of randomness, i.e. φ(x1 , . . . , xw ) := φ′ (x1 , . . . , xw , z) where z is the source of randomness. However,
in each of our theorems, we can without loss of generality assume that the queries functions are
deterministic by using the random sample as a source of randomness. If the original sample is D we
consider the augmented distribution D′ := D × Z1 × Z2 × · · · × ZT where each Zt will be a source of
randomness for the tth query. We then create a deterministic query ψt : (X × Z1 × · · · × ZT )w → Y
that evaluates
ψt (a1 , . . . , aw ) := φ′ ((a1 )x , . . . , (aw )x , (a1 )zt )
where for a ∈ X × Z1 × · · · × ZT , the notation ax takes the first component of a and azt takes the
(t + 1)th component of a. Using ψt with the distribution D′ will generate the exact same distribution
of responses as φt with the distribution D, despite the former being a deterministic function and
the later a randomized function.

5 Average leave-many-out KL stability and its connection to mu-


tual information
The starting point of our analysis is a new notion of algorithmic stability. We require the algorithm’s
output to be stable even if a large portion of the dataset is removed. This definition generalizes

17
[FS18]’s notion of ”Average leave-one-out KL stability,” which is equivalent to the below with m = 1.

Definition 10 (Average leave-many-out KL stability, Definition 3 restated). A randomized algorithm


M : X n → Y is ALMOKL stable with parameters (m, ε) if there is a randomized algorithm
M′ : X n−m → Y such that, for all samples S ∈ X n ,

dKL M(S) M′ (SJ ) ≤ ε.


 
E
[n]
J∼(n−m )

We’ll furthermore use the notation stabm (M) to refer to the infimum over ε such that M is
(m, ε)-ALMOKL stable.

The utility of ALMOKL stability is that we can use it to upper bound the mutual information
between a sample and a series of adaptively chosen randomized algorithms.

Theorem 12 (Using ALMOKL stability to upper bound mutual information). Let S drawn from
a product distribution over X n and, for each t ∈ [T ], draw yt ∼ My1 ,...,yt−1 (S). Then, for any
m ∈ [n], " #
n X
y1 ,...,yt−1
I(S; (y, . . . , yT )) ≤ ·E stabm (M ) .
m
t∈T

The proof of Theorem 12 is broken into two pieces. First we bound the drop in m-conditional
entropy in terms of ALMOKL stability.

Definition 11 (m-conditional entropy). For any random variable S supported on X n and m ∈ [n],
the m-conditional entropy is defined as

Hm (S) := E [H(SJ | S−J )].


J∼([n]
m)

Furthermore, for any y with marginal distribution Dy , we use

Hm (S | y) := E [Hm ((S | y = y ′ ))].


y ′ ∼Dy

Lemma 5.1 (ALMOKL stability bounds the drop in m-conditional entropy, Lemma 5.1 restated).
For any (m, ε)-ALMOKL stable M : X n → Y and random variable S supported on X n (including
non-product distributions),
Hm (S) − Hm (S | M(S)) ≤ ε.

The second ingredient of Theorem 12 is using the drop in m-conditional entropy to upper bound
mutual information.

Lemma 5.2 (m-conditional entropy bounds mutual information, Lemma 2.4 restated). For any
random variables S, y where the distribution of S is product on X n and m ≤ n,
n
I(S; y) ≤ · (Hm (S) − Hm (S | y)).
m
Before proving each of these two lemmas individually, we’ll show how they imply Theorem 12.

18
Proof of Theorem 12 given Lemmas 5.1 and 5.2. We first use Lemma 5.1 bound the drop in m-
conditional entropy of S by conditioning on y := (y1 , . . . , yT ).
" #
X
E[Hm (S) − H(S | y)] = E H(S | y1 , . . . , yt−1 ) − H(S | y1 , . . . , yt )
y y
t∈T
" #
X
y1 ,...,yt−1
≤E stabm (M ) .
y
t∈T

The desired result follows from Lemma 5.2.

5.1 Bounding the drop in m-conditional entropy using ALMOKL stability


In this subsection, we prove Lemma 5.1. This proof uses many of the ideas from Feldman and
Steinke’s proof of a similar result in the special case where m = 1 [FS18]. Just like them, we use
the following fact that states the “center” of a collection of probability, as measured by minimizing
the average KL divergence to each distribution, is exactly the mean of those distributions.

Fact 5.3 ([BMD+ 05] Proposition 1 & [FSG08] Theorem II.1). Let {Da } be a family of distributions,
each supported on X, and indexed by a ∈ A and A be a distribution supported on A. Let DA =
Ea∼A [Da ] denote the convex combinations of the distributions {Da } with mixture weights given by
A. Then, for any D′ supported on X,

E [dKL (Da ∥DA )] ≤ E dKL Da D′ .


 
a∼A a∼A

Fact 5.3 actually holds for any Bregman divergence, which, in particular, includes KL divergence.

Proof of Lemma 5.1. Expanding the definitions, our goal is to show that for any S distributed on
X n , m ∈ [n], and randomized algorithms M : X n → Y , M′ : X n−m → Y ,
!
E dKL M(S) M′ (S−J )
 
E [H(SJ | S−J ) − H(SJ | S−J , M(S))] ≤ sup .
[n]
J∼( m ) S∈X n J∼([n]
m)

Rather than picking a worst-case S, we’ll prove the above holds on average over the distribution of
S: " #
E dKL M(S) M′ (S−J ) .
 
E [H(SJ | S−J ) − H(SJ | S−J , M(S))] ≤ E
J∼([n] S ([n]
m) m)
J∼

We’ll show that the above inequality holds for each J ∈ [n]

m individually, which implies the desired
result by averaging over the choices of J. Throughout this proof, we’ll use Pr[·] as an operator that
takes as input a random variable and outputs the probability a new draw of that random variable
takes on that value (for example, H(x) = Ex [ln(1/ Pr[x])]). Let y ∼ M(S). Then,
    
1 1
H(SJ | S−J ) − H(SJ | S−J , M(S)) = E ln − ln
S,y Pr[SJ | S−J ] Pr[SJ | S−J , y]
  
Pr[SJ | S−J , y]
= E ln
S,y Pr[SJ | S−J ]

19
Note that, for any random variable a, b, c,

Pr[a | b, c] Pr[a | b, c] Pr[b | c] Pr[a, b | c] Pr[b | a, c]


= · = = .
Pr[a | c] Pr[a | c] Pr[b | c] Pr[a | c] · Pr[b | c] Pr[b | c]

Therefore,
  
Pr[y | S]
H(SJ | S−J ) − H(SJ | S−J , M(S)) = E ln
S,y Pr[y | S−J ]
= E[dKL ((y | S)∥(y | S−J ))]
S
 
= E E [dKL ((y | S−J , SJ )∥(y | S−J ))] .
S−J SJ |S−J

Here, we utilize Fact 5.3. Fix any choice of S−J = S−J . The distribution of (y | S−J = SJ )
is exactly the convex combination of y | SJ , S−J where each SJ is sampled conditioned on S−J .
Therefore, it is the distribution minimizing the above KL divergence. As a result, we can replace it
with any other distribution, namely M′ (S−J ), without decreasing the KL divergence.
 

 
H(SJ | S−J ) − H(SJ | S−J , M(S)) ≤ E E dKL (y | S−J , SJ ) M (S−J
S−J SJ |S−J

= E dKL M(S) M′ (S−J .


 
S

5.2 Bounding mutual information using m-conditional entropy


In this subsection, we prove Lemma 5.2 which bounds the mutual information between S and y in
terms of the average drop in m-conditional of S conditioned on y. We begin with the special case
where n is even and m = n/2, as it is simpler and furthermore sufficient to recover all of our results
with the mild assumption that the sample size is even.

Proposition 5.4. For any S ∈ X 2m and J ∈ [2m]



m ,

H(S) = 2Hm (S) + E [I(SJ ; S−J )].


J∼([2m]
m )

[2m]

Proof. For any choice of J ∈ m ,

H(S) = H(S−J ) + H(SJ | S−J ) (Fact 4.1)


= (H(S−J | SJ ) + I(S−J ; SJ )) + H(SJ | S−J ). (Definition 8)

The desired result follows by averaging over all choices of J and noting that the distribution of J
and its compliment are identical.

20
Proof of Lemma 5.2 in the special case where m = n/2. We bound,

I(S; y) = H(S) − H(S | y) (Definition 8)


!
= 2Hm (S) + E [I(SJ ; S−J )] − 2Hm (S | y) + E [I(SJ ; S−J | y)]
J∼([2m]
m )
J∼([2m]
m )
(Proposition 5.4)
= 2Hm (S) − 2Hm (S | y) − E [I(SJ ; S−J | y)] (S is product)
J∼([2m]
m )

≤ 2Hm (S) − 2Hm (S | y) (Nonnegativity of mutual information, Fact 4.2)

This is exactly the special case of Lemma 5.2 where m = n/2.

To expand beyond this special case, we will appropriately generalize Proposition 5.4. We separate
it into two cases: When S and is product and when it is not.
Lemma 5.5. If S is drawn from a product distribution over X n ,
n
H(S) = · Hm (S).
m
Proof. We can expand H(S) using the additivity of entropy, where S<i := [S1 , . . . , Si−1 ],
X
H(S) = H(Si | S<i )
i∈[n]
X
= H(Si ). (S is product)
i∈[n]

Similarly, we expand Hm (S),

Hm (S) = E [H(SJ | S−J )]


J∼([n]
m)
 
X
= E  H(Sj ).
J∼([n]
m ) j∈J

m
The desired result follows from each index appearing in j with probability n.

Lemma 5.6. For any S distributed on X n (including non-product distributions),


n
H(S) ≥ · Hm (S).
m
Taken together, Lemmas 5.5 and 5.6 show that H(S) − n/m · Hm (S) is a measure of the distance
from S to a product distribution.
The proof of Lemma 5.6 uses the following.
Proposition 5.7. For any S distribution on X n and k ≤ n − m,
Hm (S)
E [H(Sj | SK )] ≥ . (6)
[n]
K∼( )j∼([n]\K) m
k

21
Proof. First, since conditioning cannot increase entropy (Corollary 4.3), the lefthand side of Equa-
tion (6) is non-increasing in k. Therefore it suffices to
n
 consider the case where k = n − m. By the
additivity of entropy, for any J := (j1 , . . . , jm ) ∈ [m] ,
X
H(SJ | S−J ) = H(Sji | S−J , Sj1 , . . . , Sji−1 )
i∈[m]
X
≤ H(Sji | S−J ) (Corollary 4.3)
i∈[m]

Dividing both sides by m and averaging over all choices of J = J exactly recovers the desired
result.
n

Proof of Lemma 5.6. Fix a single choice J ∈ [m] and let k1 , . . . , kn−m be the elements in its
compliment. Then, by the additivity of entropy,
 
X
H(S) =  H(Ski | Sk1 , . . . , Ski−1 ) + H(SJ | S−J ).
i∈[n−m]

Averaging over all choices of J and all permutations of k1 , . . . , kn−m ,


 
X
H(S) =  E [H(Sj | SK )] + Hm (S)
[n]
i∈[n−m] K∼ (i−1 )j∼([n]\K)

Hm (S)
≥ (n − m) · + Hm (S) (Proposition 5.7)
m
n
= · Hm (S).
m
Finally, we prove Lemma 5.2.

Proof of Lemma 5.2. We bound,

I(S; y) = H(S) − H(S | y) (Definition 8)


n
= · Hm (S) − H(S | y) (Lemma 5.5, S is product)
m
n n
≤ · Hm (S) − · Hm (S | y) (Lemma 5.6)
m m
which is exactly the desired result.

6 Bounding the stability of subsampling queries


The goal of this section is prove the following result.
Lemma 6.1. For any function φ : X w → Y and m ≤ n, the randomized algorithm which maps
S ∈ X n to φ(S) is (m, ε)-ALMOKL stable for
|Y | − 1 w(|Y | − 1)
ε := ≤ .
⌊ n−m
w ⌋ + 1 n−m+1

22
Before proving Lemma 6.1, we show how it gives a bound on the mutual information between
the sample and query responses.

Theorem 13 (Theorem 2 restated). For any distribution D over domain X and analyst that makes
a series of adaptive queries φ1 : X w1 → Y1 , . . . , φT : X wT → YT to a sample S ∼ Dn and receives
responses y := (y1 , . . . , yT ) P 
4 E t∈T wt |Yt |
I(S; y) ≤
n
n
where the expectation is both over the sample S ∼ D and the analyst’s decisions.

Proof of Theorem 13 assuming Lemma 6.1. We set m := ⌈n/2⌉. By Lemma 6.1, the m-ALMOKL
|−1) 2w(|Y |−1)
stability of a subsampling query φ : X w → Y is w(|Y
n−m+1 ≤ n . The desired result follows
from Theorem 12 and that n/m ≤ 2.

To prove Lemma 6.1, we need to design a randomized M′φ for which,

dKL φ(S) M′φ (SJ ) ≤ ε.


 
E
[n]
J∼(n−m )

A natural choice for M′φ is to make it perform the subsampling query φ, meaning it outputs y with
probability Prx1 ,...,xw ∼(SJ )(w) [φ(x1 , . . . , xw ) = y]. Unfortunately, this natural choice can lead to an
infinite KL-divergence, even in the simplest case where w = 1. If S has only contains a single point
that evaluates to y, then with nonzero probability SJ will contain no points that evaluate to y. In
that case, the support of φ(S) will be strictly smaller than that of φ(S), leading to an infinite KL
divergence.
To avoid this, we smooth M′φ to ensure it has a nonzero probability of outputting every point
in Y . Specifically, we set,
(
′ Unif(Y ) wp α |Y |
Mφ (SJ ) := where, α := . (7)
φ(SJ ) wp 1 − α |Y | + ⌊ n−m
w ⌋

Our choice of α is purely for technical convenience. As we will see in the proof of Lemma 6.1
that with this choice of α we are able to use the following inverse moment bound for the binomial
distribution.

Fact 6.2 (Inverse expectation of a binomial random variable). For any n ∈ N and p ∈ (0, 1),

1 − (1 − p)n+1
 
1 1
E = ≤ .
k∼Bin(n,p) k + 1 p(n + 1) p(n + 1)

23
Proof. We expand the desired expectation,
n n k
 n−k
k p (1 − p)
  X
1
E =
k+1 k+1
k=0
n  
1 X n+1 k
= · p (1 − p)n−k
n+1 k+1
k=0
n  
1 X n + 1 k+1
= · p (1 − p)(n+1)−(k+1)
p(n + 1) k+1
k=0
1
= · Pr [k′ ̸= 0]
p(n + 1) k′ ∼Bin(n+1,p)
1 − (1 − p)n+1
= .
p(n + 1)
Other choices of α in Equation (7) would correspond to different inverse moments of the binomial,
which require a more involved analysis than the above simple fact (see e.g. [CS72]). Our proof of
Lemma 6.1 uses our generalization of Hoeffding’s reduction theorem, Theorem 7, which we prove in
Section 6.1.

Proof of Lemma 6.1. We use the mechanism M′φ defined in Equation (7). Our task is to bound,
for any sample S ∈ X n ,
   
 ′
 Pr[φ(S) = y]
E dKL φ(S) Mφ (SJ ) = E E ln .
[n]
J∼(n−m ) [n]
J∼(n−m ) y∼φ(S) Pr[M′φ (SJ ) = y]
n
, we’ll define qy (SJ ) := Pr[φ(SJ ) = y]. Then, for ℓ := ⌊ n−m

For each y ∈ Y and J ∈ n−m w ⌋,

1 ℓ ℓ · qy (SJ ) + 1
Pr M′φ (SJ ) = y =
 
+ · qy (SJ ) = .
ℓ + |Y | ℓ + |Y | ℓ + |Y |
We’ll also use py (S) as shorthand for Pr[φ(S) = y]. With this notation,
   
py (S)
dKL φ(S) M′φ (SJ ) = E  E ln ℓ·q (S )+1 
 
E
[n] y∼φ(S) J∼( [n] y J
J∼(n−m) n−m) ℓ+|Y |
"   #
py (S)(ℓ + |Y |)
= E E ln
y∼φ(S) J∼( [n] ) ℓ · qy (SJ ) + 1
n−m
   
py (S)(ℓ + |Y |)
≤ E E ln (Theorem 7)
y∼φ(S) z∼Bin(ℓ,py (S)) z+1
  
py (S)(ℓ + |Y |)
≤ E E −1 (ln x ≤ x − 1)
y∼φ(S) z∼Bin(ℓ,py (S)) z+1
  
py (S)(ℓ + |Y |)
≤ E E −1 (Fact 6.2)
y∼φ(S) z∼Bin(ℓ,py (S)) py (S)(ℓ + 1)
|Y | − 1
=
ℓ+1
which is exactly the desired bound based on our definition of ℓ := ⌊ n−m
w ⌋.

24
6.1 Improving Hoeffding’s reduction theorem
In this section, we prove Theorem 7.

Theorem 14 (Theorem 7 restated). Consider any finite population X ∈ Rn and φ : X w → R. For


any convex f : R → R and integer m ≤ n
  
f  1
X
E [f (µφ (XJ ))] ≤ E φ(Ti )
[n]
J∼( ) iid
T1 ,...,T ∼ X (w)
k
m k i∈[k]

where k = ⌊m/w⌋ and T ∼ X (w) indicates that T contains w elements from X sampled uniformly
without replacement.

Proof. Without loss of generality, and to ease notation, we will assume that X = [n]. Otherwise,
for X = (x1 , . . . , xn ), we could consider the kernel ψ(i1 , . . . , iw ) := φ(xi1 , . . . , xiw ), because the
U-statistic with kernel φ and population X has the same distribution as that with kernel ψ and
population [n].
iid
We draw T1 , . . . , Tk ∼ [n](w) and note that at most wk ≤ m unique elements are contained in
[n]

T1 ∪ · · · ∪ Tk . Then, we draw J uniform from m conditioned on all the those unique elements
being contained in J . By symmetry, we observe the following:

1. Marginally
 (i.e. not conditioning on the values of T1 , . . . , Tk ), the distribution of J is uniform
over [n]
m .

2. Conditioning on any choice of J = J, the distribution of Ti is uniform over J (w) for every
i ∈ [k].

We proceed to bound,

E[f (µφ (XJ ))] = E[f (µφ (J ))] (WLOG, X = [n])


J J
  

=E f E [φ(T )] (Definition of µφ (J ))
J T ′ ∼J (w)
  
= E f E [φ(Ti )] for any i ∈ [k] (Ti | J is uniform over J (w) )
J Ti |J
   
1 X
= Ef  E  φ(Ti ) (Linearity of expectation)
J T |J k
i∈[k]
   
1 X
≤ E E f  φ(Ti ) (Jensen’s inequality)
J T |J k
i∈[k]
  
1 X
= Ef  φ(Ti ), (Law of total expectation)
T k
i∈[k]

iid
which, since we drew T1 , . . . , Tk ∼ [n]w , is exactly the desired result.

25
7 From small mutual information to small bias
In this section, we connect the mutual information bound of Theorem 2 to generalization error,
completing the proof of Theorems 1, 8 and 9. Recall our definition of error for a test query.

Definition 12 (Definition 5, restated). For any ψ : X w → [0, 1] and distribution D over X, we


define
∆2
 
1
error(ψ, S, D) := · min ∆, 2
w σ
where
∆ := |µψ (S) − µψ (D)| and σ 2 := Varw [ψ(S)].
S∼D

Theorem 15 (Mutual information bounds bias). For any S ∼ Dn and y ∈ Y , as well as Ψ(y) a
set of test queries each mapping X ∗ → [0, 1] chosen as a function of y,
" #
16I(S; y) + 8 Ey ln Ψ(y) + 24 ln 2
 
E sup error(ψ, S, D) ≤ .
S,y ψ∈Ψ(y) n

Our proof will use the following fact.

Fact 7.1 ([FS18]). For any random variables S ∈ X and ψ : X → R ∈ Ψ, as well as λ > 0,
 !
1
E [ψ(S)] ≤ I(S; ψ) + sup ln E[exp(λψ(S))] .
S,y λ ψ∈Ψ S

Corollary 7.2. For any random variables x ∈ X and f : X → R ∈ F satisfying, for all t ≥ 0 and
f ∈ F , Prx [f (x) ≥ t] ≤ e−t ,
E[f (x)] ≤ 2(I(x; f ) + ln 2).

Proof. Fix an arbitrary f ∈ F . We first bound the moment generating function of f (x).
Z ∞
E[exp(f (x)/2)] = Pr[exp(f (x)/2) ≥ t]dt
0
Z ∞
=1+ Pr[f (x) ≥ 2 ln t]dt
1
Z ∞
≤1+ t−1/2 dt = 2.
1

The desired result follows from Fact 7.1 with λ = 1/2.

Bernstein’s inequality will give the sorts high probability bounds needed to apply Corollary 7.2.

Fact 7.3 (Bernstein’s inequality). For any iid mean-zero random variables a1 , . . . , an with variance
σ 2 and satisfying, almost surely, that |ai | ≤ 1, let A = n1 · i∈[n] ai . Then,
P

!
∆2 n
Pr[|A| ≥ ∆] ≤ 2 exp − . (8)
2(σ 2 + ∆
3)

26
For our setting, a black-box application of Bernstein’s inequality is not sufficient. We wish to
prove concentration of a random variable that is not necessarily the sum of iid random variables.
Fortunately, the proof of Bernstein’s inequality only uses a bound on the moment generating function
of A: It proceeds by applying Markov’s inequality to the random variable eλA for an appropriate
choice of λ ∈ R. As a result, the following also holds.
Fact 7.4 (Generalization of Bernstein’s inequality). Let B be any random variable satisfying, for
all λ ∈ R,
E[eλB ] ≤ E[eλA ],
where A is as in Fact 7.3. Then, the bound of Equation (8) also holds with B in place of A.
We’ll use the following to produce a bound in our setting.
Proposition 7.5. For any ψ : X w → R, distribution D over X, and convex function f : R → R,
set k = ⌊n/w⌋. Then,
  
E n [f (µψ (S))] ≤ E f E [ψ(Ti )]] .
S∼D iid i∼[k]
T1 ,...,Tk ∼ Dw

Proposition 7.5 is “almost” a simple corollary of Theorem 7. Suppose we first sampled X ∼ DN


for some large N and then S ′ ∼ X (n) . As N grows, the distribution of S ′ approaches that of S.
iid
Similarly, if we draw T1′ , . . . , Tk′ ∼ X (w) , then the distribution of (T1 , . . . , Tk ) is equivalent to that
of (T1′ , . . . , Tk′ ). By comparing the statement of Proposition 7.5 to that of Theorem 7, we see that
they become equivalent as N goes to infinity. To avoid dealing with this limiting behavior, we
instead give a direct proof of Proposition 7.5.
iid
Proof. We’ll couple the distributions of T1 , . . . , Tw and S. First, draw T1 , . . . , Tw ∼ Dw and
Tw+1 ∼ Dn−kw . Then concatenate these (w + 1) groups of points and uniformly permute all n
points to form S. We observe that
1. The distribution of S is uniform over Dn , as it is in the statement of Proposition 7.5.

2. Conditioning on the value of S = S, the distribution of Ti is uniform over S (w) for any i ∈ [k].
Finally, we bound,
  
E [f (µψ (S))] = E n f E [ψ(Ti )] for any i ∈ [k] (Ti uniform in S (w) )
S∼Dn S∼D T |S
   
= E n f E E [ψ(Ti )] (Linearity of expectation)
S∼D T |S i∼[k]
   
≤ E n E f E [ψ(Ti )] (Jensen’s inequality)
S∼D T |S i∼[k]
  
= E f E [ψ(Ti )]] (Law of total expectation)
iid
T1 ,...,Tk ∼ Dw i∼[k]

Since x 7→ exp λx is convex, using Fact 7.4, we arrive at the following corollary.

27
Corollary 7.6. For any ψ : X w → [−1, 1] and distribution D over X such that ES∼Dw [ψ(S)] = 0
and VarS∼Dw [φ(S)] = σ 2 ,
!
∆2 k
 
k 2 2
Pr [|ψ(S)| ≥ ∆] ≤ 2 exp − ≤ 2 exp − min(∆, ∆ /σ )
S∼Dn 2(σ 2 + ∆
3)
4
n
where k := w .
Finally, we complete the proof of Theorem 15.
Proof of Theorem 15. For each y ∈ Y , we define f (y) : X n → R as
n · error(ψ, S, D)
f (y) (S) := sup − ln(2|Ψ(y) |).
ψ∈Ψ(y) 8

We claim that for all y ∈ Y and t > 0, Pr[f (y) (S) ≥ t] ≤ e−t . First, consider a single test function
ψ : X w → [0, 1]. By Corollary 7.6 applied to the centered query ψ ′ := ψ − ψ(D), as well as the
bound n/(2w) ≤ k when k := ⌊n/w⌋
Pr [error(ψ, S, D) · n/8 ≥ ε] ≤ 2 exp(−ε)
S∼Dn
By the union bound,
 
(y)
X n · error(ψ, S, D) (y)
Pr[f (S) ≥ t] ≤ 2 Pr − ln(2|Ψ |) ≥ t
S
(y)
8
ψ∈Ψ
(y) |)
≤ 2|Ψ(y) | · e−t−ln(2|Ψ = e−t .
Therefore, by Corollary 7.2,
" #
n · error(ψ, S, D) (y)
E sup − ln(2|Ψ |) ≤ 2(I(S; y) + ln 2).
S,y ψ∈Ψ(y) 8
Rearranging yields,
" #
16I(S; y) + 8 Ey ln Ψ(y) + 24 ln 2
 
E sup error(ψ, S, D) ≤ .
S,y ψ∈Ψ(y) n
Theorem 9 is a direct consequence of Theorems 13 and 15. For completeness, we show how
Theorem 9 implies Theorem 8.
Proof of Theorem 8 from Theorem 9. For each t ∈ [T ] and y ∈ Y , we design a test query ψt,y :
X w → [0, 1] that maps (x1 , . . . , xw ) to 1[φt (x1 , . . . , xw ) = y], for a total of m = t|Y | test queries.
2 as shorthand for φ (S)(y) − φ (D)(y) and Var
Using ∆t,y and σt,y t t φt,y (D) respectively,
" # " #
E sup ∆2t,y = E sup min(∆t,y , ∆2t,y ) (x ≥ x2 for x ∈ [0, 1])
t∈[T ],y∈Y t∈[T ],y∈Y
" #
∆2t,y 2 ≤ 1)
≤E sup min(∆t,y , 2 ) (σt,y
t∈[T ],y∈Y σt,y
" #
=E sup w · error(ψ, S, D) (Definition 12)
t∈[T ],y∈Y

The desired result follows from Theorem 9.

28
8 Boosting the success probability
Theorem 9 proves that, with constant success probability, the analyst cannot find a test on which
the sample is biased. In this section, we prove Theorem 10 showing that, when all of the analysts
queries φ : X w → Y satisfy w = 1, that guarantee holds with high probability. We do this via a
reduction from small failure probability to constant failure probability.

Notation Throughout this section, we will only consider analysts who make queries of the form
φ : X → Y and tests of the form ψ : X → [0, 1] (corresponding to w = 1). For now, we will also
restrict to analysts that only output a single test. Given an analyst A and sample S ∈ X n , we’ll use
the notation A(S) as shorthand for the distribution of tests A asks on the sample S; i.e., ψ ∼ A(S)
is shorthand for, at each t ∈ [T ], setting φt = A(y1 , . . . , yt−1 ) and yt ∼ φt (S), and then setting
ψ = A(y1 , . . . , yT ).

Definition 13 (Budgeted analyst). We sayP an analyst A is (cost, b)-budgeted if, for any sample S,
the queries A(S) asks, φ1 , · · · , φT , satisfy t∈[T ] cost(φt ) ≤ b almost surely.

Lemma 8.1 (Boosting from constant to small failure probability). For any distribution D over X,
sample size n, cost function cost, budget b, and threshold τψ ∈ [0, 1] for each ψ : X → [0, 1], suppose
that for all analysts A that are (cost, b)-budgeted,
1
Pr [µψ (S) ≥ τψ ] ≤ .
S∼Dn ,ψ∼A(S) 100

Then, for any sample size N ≥ n, k = ⌊N/n⌋, and all analysts A′ that are (cost, B := bk/100)-
budgeted almost surely,
 
1
Pr µψ (S) ≥ τψ + ≤ exp(−Ω(k)).
S∼DN ,ψ∼A′ (S) n

At a high level, the proof of Lemma 8.1 exploits a classic and well-known technique for boosting
success probabilities: Given an algorithm that fails with constant probability, if we run k independent
copies, with probability 1 − 2−Ω(k) , a large fraction of those copies will succeed. Typically, this
technique gives a framework for modifying existing algorithms – for example, by taking the majority
of multiple independent runs – in order to produce a new algorithm with a small failure probability.
Interestingly, in our setting, no modification to the algorithm is necessary. If we answer
subsampling queries in the most natural way, they “automatically” boost their own success probability.
The key insight, as depicted in Figure 3, is that a single large sample S ∼ DN :=nk can be cast as
iid
k groups of samples S1 , . . . , Sk ∼ Dn and the output of a query y ∼ φ(S) is the same as that of
y ′ ∼ φ(Si ) where i ∼ [k] is a random group. Using this insight, we are able to prove the following.

Lemma 8.2 (It is exponentially unlikely many groups have high error). In the setting of Lemma 8.1,
let S1 be the first n elements of S, S2 the next n, and so on. Then,
 
X
Pr  1[µψ (Si ) ≥ τψ ] ≥ 0.03k  ≤ e−k/300 .
S∼DN ,ψ∼A′ (S)
i∈[k]

29
The hypothesis of Lemma 8.1 roughly corresponds to Pr[µψ (Si ) ≥ τψ ] ≤ 1/100 for each i ∈ [k].
If these events were independent for each i ∈ [k], then the conclusion of Lemma 8.2 would follow
from a standard Chernoff bound. Unfortunately, they are not necessarily independent. To get
around that, in Lemma 8.4 we extend a direct product theorem of Shaltiel’s showing, roughly
speaking, those events are no worse than independent.
We combine Lemma 8.2 with the following.
Lemma 8.3. For N ≥ nk, let S ∈ X N be drawn from any permutation invariant distribution,
meaning Pr[S = S] = Pr[S = σ(S)] for any permutation of the indices σ. Let S1 be the first n
elements of S, S2 the next n, and so on. For any query ψ : X → Y and threshold τ , let b be the
random variable counting the number of i ∈ [k] for which µψ (Si ) ≥ τ . Then,

Pr[µψ (S) ≥ τ + 1/n] ≤ 200 Pr[b ≥ 0.03k].

Lemma 8.1 is a straightforward consequence of the above two lemmas.

Proof of Lemma 8.1 assuming Lemmas 8.2 and 8.3. The analyst chooses a query ψ (y) as a function
of the responses y := y1 ∼ φ1 (S), . . . , yT ∼ φT (S). Our goal is to show that Pr[µψ(y) (S) ≥
τψ(y) + 1/n] ≤ 200 exp(−k/300).
Conditioned on any possible sequence of responses, y = y, the distribution of S is permutation
invariant. Therefore,
 h i
Pr[µψ(y) (S) ≥ τψ(y) + 1/n] = E Pr µψ(y) (S) ≥ τψ(y) + 1/n
y S|y
  
X
≤ E200 Pr 1[µψ(y) (Si ) ≥ τψ(y) ] ≥ 0.03k  (Lemma 8.3)
y S
i∈[k]
 
X
= 200 Pr  1[µψ (Si ) ≥ τψ ] ≥ 0.03k 
S∼DN ,ψ∼A′ (S)
i∈[k]
−k/300
≤ 200e . (Lemma 8.2)

8.1 Proof of Lemma 8.2


In order to prove Lemma 8.2, in Figure 5 we will define an “analyst game” formalizing the setting
in which an analyst has multiple distinct samples each of which they can ask queries to. We then
prove a direct product theorem analogous to to Shaltiel’s direct product theorem for fair decision
trees [Sha04].
Lemma 8.4 (Direct product theorem for analyst games). In Figure 5, fix the domain X, query
class Φ, test class Ψ and cost function cost for which inf φ∈Φ (cost(φ)) > 0. For any distributions
D := D1 × · · · Dk each supported on X and budgets b ∈ Rk , let AG(D, b) be the maximum probability
an analyst wins the game described in Figure 5. Then,
Y
AG(D, b) ≤ AG(Di , bi ). (9)
i∈[k]

30
Parameters: A product distribution D := D1 × · · · × Dk over domain X, budgets b ∈ Rk ,
a class of queries Φ each mapping X to a distribution of possible responses, function
cost : Φ → R≥0 , and class of tests Ψ each mapping X to {0, 1}.

Setup: Samples x1 , . . . , xk ∼ D are drawn and not revealed to the analyst.

Execution: The analyst repeats as many times as desired:


1. The analyst chooses a query φ ∈ Φ and index i ∈ [k]
2. The analyst receives the response y ∼ φ(xi ).
3. The budget is decremented: bi ← bi − cost(φ).
Afterwards, the analyst chooses tests ψ1 , . . . , ψk . The analyst wins if bi ≥ 0 and ψi (xi ) = 1
for all i ∈ [k].

Figure 5: An analyst game.

It’s straightforward to see that the (≥) direction of Equation (9) also holds, but we will not need
that direction in this paper.

Proof. First note that if, for any i ∈ [k], bi < 0, then both sides of Equation (9) are equal to 0. We
may therefore assume, without loss of generality that bi ≥ 0 for all i ∈ [k].
Consider an arbitrary analyst,
Q A, for distribution D and budget b. We will prove that the
probability A wins is at most i∈[k] AG(Di , bi ) by induction on the number of queries the analyst
makes.7 In the base case, A makes zero queries and directly chooses tests. Then, A’s probability of
winning is upper bounded by
 
Y Y
sup Pr  ψi (xi ) = sup Pr [ψi (xi )] (D is product)
φ1 ,...,φk ∈Ψ x∼D φ1 ,...,φk ∈Ψ x∼D
i∈[k] i∈[k]
Y
= sup Pr [ψi (xi )] ≤ AG(Di , bi ).
φi ∈Ψ xi ∼Di
i∈[k]

In the case where A executes at least one query, let φ ∈ Φ and i ∈ [k] be the query and group
respectively that A chooses first. Using b′ ∈ Rk as the vector satisfying b′i = bi − cost(φ) and b′j = bj
7
Note that since inf φ∈Φ (cost(φ)) > 0, the number of queries the analyst makes must be finite. Strictly speaking,
we must also assume that the analysts stops making queries once one of the budgets is negative, which we can clearly
assume without loss of generality.

31
for all j ̸= i, the success probability of A is upper bounded by
AG(D | φ(xi ) = y, b′ )
 
E
x∼D,y∼φ(xi )
 
Y
≤ E  AG((D | φ(xi ) = y)j , b′j ) (inductive hypothesis)
x∼D,y∼φ(xi )
j∈[k]
 
Y
= E AG((D | φ(xi ) = y)i , bi − cost(φ)) · AG(Dj , bj )
x∼D,y∼φ(xi )
j̸=i
Y
= AG(Dj , bj ) · E [AG((D | φ(xi ) = y)i , bi − cost(φ)].
x∼D,y∼φ(xi )
j̸=i

The quantity Ex∼D,y∼φ(xi ) [AG((D | φ(xi ) = y)i , bi − cost(φ)] is exactly the win probability on Di , bi
of the analyst whose first query is φ, and remaining strategy is optimal. Therefore,
Q it is upper
bounded by AG(Di , bi ) and so the win probability of A is upper bounded by i∈[k] AG(Di , bi ), as
desired.

We’ll also use the following version of the classic Chernoff bound.
Fact 8.5 (Chernoff bound). Let x1 , . . . , xk be random variables each taking on values in {0, 1} and
satisfying, for some p < 1 and all S ⊆ [k],
" #
\
Pr xi ≤ p|S| .
i∈S

Then, for any δ > 0,  


 2 
X δ pk
Pr xi ≥ (1 + δ)pk  ≤ exp − .
2+δ
i∈[k]

Proof of Lemma 8.2. Let Sk+1 be the remaining N − nk elements not in S1 , . . . , Sk . At each time
step t ∈ [T ], the analyst asks a query φt : X → Yt and receives the response yt = φt (xt ). We can
think of xt as being chosen via a two step process: First, a group it ∈ [k + 1] is chosen, and second
xt is chosen uniform from Si . We will show that Lemma 8.2 holds even conditioned on any choice
of i1 = i1 , . . . , iT = iT .
For each group i ∈ [k], let ai be the indicator that the total cost of queries to Si is at most b
and µψ (Si ) ≥ τψ . Since there is a total of B = bk/100 budget, regardless of the analyst partitions
k
the queries to groups, at most 100 P will receive queries with total cost ≥ b. It is therefore
groups
sufficient to bound the probability that i∈[k] ai ≥ 0.02k. Applying Lemma 8.4 and the hypothesis
of Lemma 8.1, for any set of groups G ⊆ [k],
" # 
1 |G|
\ 
Pr ai ≤ .
100
i∈G

Applying the Chernoff bound in Fact 8.5,


 
 
X k
Pr ai ≥ 0.02k  ≤ exp − .
300
i∈[k]

32
8.2 Proof of Lemma 8.3
We begin my proving the following technical lemma.

Lemma 8.6. For any S ∈ [0, 1]N and n < N , let x be sampled by taking the sum of n elements
from S chosen uniformly without replacement. Then,

2 3−3
Pr[x > E[x] − 1] ≥ > 0.0357.
13
We conjecture that the probability x exceeds E[x] − 1 is actually lower bounded by 1/2 - such
a result is known if, for example, x were drawn from a binomial distribution [Neu66, KB80] or
hypergeometric distribution [Sie01]. For our purposes, any constant probability suffices and we are
therefore able to use a quantitatively weaker bound that only uses two moments of x.

Fact 8.7 ([Ver08]). For any mean-0 random variable x with nonzero variance,

2 3−3
Pr[x > 0] ≥ E[x4 ] .
E[x2 ]2

E[x ] 4
By Jensen’s inequality, E[x 2 ]2 is at least 1. For that ratio to be small, x must concentrate well.

The intuition behind Fact 8.7 is that for x to rarely exceed its mean, it must be strongly skewed to
one side, which will lead to the ratio of moments being large. Next, we bound this ratio for the
particular x we care about. To make the arithmetic simpler, we compare to the setting where the
sample was performed with replacement using Hoeffding’s reduction theorem (Theorem 6).

Proposition 8.8. For any S ∈ RN with elements summing to 0 and n ≤ N . Let x be the sum of n
elements sampled uniformly without replacement from S, and y be the sum of n elements sampled
uniformly with replacement from S. Then,
2
E[x4 ] E[y 4 ]

N −1
≤ · .
E[x2 ]2 N −n E[y 2 ]2

Proof. By Hoeffding’s reduction theorem (Theorem 6), E[x4 ] ≤ E[y 4 ]. Therefore it suffices to show
−n
that E[x2 ] = E[y 2 ] · NN −1 .
Let x1 , . . . , xn and y1 , . . . , yn be the n elements of S chosen without replacement and with
replacement respectively. Then, since the yi are mean-0 and independent, for i ̸= j ∈ [n], E[yi yj ] = 0.
Meanwhile, using the fact that the sum of elements in S is 0,
 
1 X X 1 X X 1 X
E[xi xj ] = Sa Sb = Sa  Sb − Sa  = − Sa2 .
N (N − 1) N (N − 1) N (N − 1)
a∈[N ] b∈[N ]\{a} a∈[N ] b∈[N ] a∈[N ]

Furthermore, we have for any i ∈ [n], that


1 X 2
E[x2i ] = E[yi2 ] = Sa .
N
a∈[N ]

33
Next, we can compute the second moment of y,
X n X 2
E[y 2 ] = E[yi2 ] = Sa .
N
i∈[n] a∈[N ]

For x,
 
2
X X n X 2 n(n − 1) X 2 n n−1 X 2
E[x ] = E[x2i ] + E[xi xj ] = Sa − Sa = 1− Sa .
N N (N − 1) N N −1
i∈[n] i̸=j a∈[N ] a∈[N ] a∈[N ]

Comparing the above gives the desired bound.

Next, we bound the desired moment ratio in the setting where the elements are sampled with
replacement.

Proposition
P 8.9. Let y1 , . . . , yn be iid mean-0 variables each bounded on [−1, 1]. Then, for
y := i∈[n] yi
E[y 4 ] 1
2 2
≤3+
E[y ] Var[y]
Proof. We’ll denote σ 2 := E[yi2 ]. Since yi is bounded, we further have that E[yi4 ] ≤ σ 2 . Then,

E[y 2 ] = nσ 2 .

Expanding E[y 4 ] gives many terms. Luckily, since the yi are each independent and mean-0, most of
those terms cancel. We are left with,

E[y 4 ] = n E[yi4 ] + 3n(n − 1)(σ 2 )2 ≤ nσ 2 + 3n2 σ 4 .

This gives
E[y 4 ] nσ 2 + 3n2 σ 4 1
2 2
≤ 2 4
=3+ .
E[y ] n σ nσ 2
The desired result follows from Var[y] = nσ 2 .

By combining the above, we have the following.

Lemma 8.10. For any S ∈ [0, 1]N and n < N , let x be sampled by taking the sum of n elements
from S chosen uniformly without replacement. Then, if Var[x] is nonzero,

2 3−3
Pr[x > E[x]] ≥ 4
12 + Var[x]

Proof. Let µ := N1 x∈S x, and S ′ be a copy of S with each element shifted by −µ. Clearly, each
P
element of S ′ is bounded on [−1, 1] and they sum to 0, so it suffices to bound Pr[x′ > 0] for x′
being the sum of n uniform elements chosen without replacement from S ′ . Furthermore, without
loss of generality, we may assume that n ≤ N/2 as, if n > N/2, we may instead consider the sum of
the elements not sampled; x′ > 0 iff the sum of the elements not sampled is negative.

34
Let y ′ be the sum of n elements chosen uniformly without replacement from S ′ . Then,

Pr[x > E[x]] = Pr[x′ > 0]



2 3−3
≥ E[x′4 ] (Fact 8.7)
E[x′2 ]2

2 3−3
≥ 2 (Proposition 8.8)
′4 E[y ]
N −1
N −n E[y ′2 ]2

2 3−3
≥ E[y ] ′4 (n ≤ N/2)
4 E[y ′2 ]2

2 3−3
≥   (Proposition 8.9)
1
4 3 + Var[y ′]

2 3−3
≥   (Theorem 6)
1
4 3 + Var[x]

Lemma 8.6 handles the case where the variance is large. For the case where it is small, we use
the following one-sided variant of Chebyshev’s inequality.

Fact 8.11 (Cantelli’s inequality). Let x be a random variable with variance σ 2 . Then,
1
Pr[x − E[x] ≥ εσ] ≤ .
1+ε
Finally, we prove Lemma 8.6.

Proof of Lemma 8.6. If Var[x] ≥ 4, the desired result follows from Lemma 8.10. Otherwise, it
follows from Fact 8.11.

We conclude with a proof of the main result of this section.

Proof of Lemma 8.3. We’ll use a as shorthand for 1[µψ (Si ) ≥ τ + 1/n]. Since Pr[b ≥ 0.03k] ≥
Pr[b ≥ 0.03k | a] · Pr[a],
Pr[b ≥ 0.03k]
Pr[a] ≤ .
Pr[b ≥ 0.03k | a]
1
Therefore, it suffices to show that Pr[b ≥ 0.03k | a] ≥ 200 . By Lemma 8.6, for any i ∈ [k]

Pr[µψ (Si ) ≥ τ | a] > 0.0357.

Using linearity of expectation,


E[b | a] > 0.0357k.
The random variable k − b is nonnegative and satisfies

E[k − b | a] < k − 0.0357k = 0.9643k.

35
Therefore, by Markov’s inequality
0.9643k
Pr[k − b ≥ 0.97k | a] ≤ < 0.995.
0.97k
Equivalently, b ≥ 0.03k with probability at least 0.005 conditioned on µψ (S) ≥ τ + 1/n, which is
exactly what we wished to show.

8.3 Proof of Theorem 10


Proof. The case where m ≥ 2 follows from the m = 1 case and a union bound, so we focus on the
m = 1 case. Let k = O(ln(1/δ)). We note that the desired bound is vacuous if k > n, so we may
assume that k ≤ n. Therefore, n′ := ⌊n/k⌋ satisfies n′ ≥ n/2k. By Theorem 9, for any analyst A′
that is (cost, 100b/k)-budgeted
    
b 1 b 1
E [error(ψ, S, D)] ≤ O + =O k· +
S∼Dn′ ,ψ∼A′ (S) k(n′ )2 n′ n2 n

For each possible test function ψ : X → [0, 1], we set the threshold to
s  !!
kb k kb k
τψ := µφ (D) + O max + , + · Varψ (D) ,
n2 n n2 n

so that, by Markov’s inequality and the expected error bound,


1
Pr [µψ (S) ≥ τψ ] ≤ .

S∼Dn ,ψ∼A′ (S) 100

Here, we apply Lemma 8.1.


   
1 2k δ
Pr µψ (S) ≥ τψ + ′ ≤ Pr µψ (S) ≥ τψ + ≤ exp(−Ω(k)) = .
n
S∼D ,ψ∼A(S) n n
S∼D ,ψ∼A(S) n 2

We can symmetrically bound the probability that µ1−φ is too high, which is equivalent to bounding
the probability µφ is too low. Union bounding over these two cases, we have that,
   
b 1
Pr error(ψ, S, D) ≥ O k · + ≤ δ.
S∼Dn ,ψ∼A(S) n2 n

9 Applications
9.1 The statistical-query mechanism: Proof of Theorem 3
The proof of Theorem 3 combines two bounds: First, with high probability, all statistical queries
asked have low bias, meaning µφ (S) and µφ (D) are close. Second, with high probability, the answer
the mechanism gives when receiving a statistical query φ is close to µφ (S). These two results are
combined with the following proposition.

36
Proposition 9.1 (Triangle-like inequality). For any τ > 0 and a, b, c ∈ [0, 1] satisfying
 p   p 
|b − a| ≤ max τ 2 , τ a(1 − a) |c − b| ≤ max τ 2 , τ b(1 − b)

it also holds that  p 


|c − a| ≤ 3 max τ 2 , τ a(1 − a) .

Proof. First, consider the case where |c − b| ≤ τ 2 . Then,


 p   p 
|c − a| ≤ |a − b| + |c − b| ≤ max τ 2 , τ a(1 − a) + τ 2 ≤ 2 max τ 2 , τ a(1 − a) .

In the other case,


p
|c − b| ≤ τb(1 − b)
p
= τ (a + (b − a))(1 − a − (b − a))
p
= τ a(1 − a) + (b − a)(1 − 2a) − (b − a)2
p
≤ τ a(1 − a) + |b − a|
p p
≤ τ a(1 − a) + τ |b − a|
r  
p p
≤ τ a(1 − a) + τ max τ 2 , τ a(1 − a)
 q 
p 2
p
2
= τ a(1 − a) + max τ , τ · τ a(1 − a)
r !
p  p 2
2 2
≤ τ a(1 − a) + max τ , max τ , τ a(1 − a)
p  p 
=τ a(1 − a) + max τ 2 , τ a(1 − a)
p
≤ 2 max(τ 2 , τ a(1 − a)).

The desired result follows from the standard triangle inequality.

The difference between yt and φt (S) is bounded by the following form of Chernoff bounds.
Fact 9.2 (Chernoff bounds). Let y be the mean of k independent random variables each supported
on {0, 1}. If
k = O ln(1/δ)/τ 2 ,


then, with probability at least 1 − δ,


 p 
|y − p| ≤ max τ 2 , τ p(1 − p)

where p := E[y]
Applying the above and a union bound over the T queries, we arrive at the following.
Corollary 9.3. In the setting of Theorem 3, with probability at least 1 − δ, for all t ∈ [T ],
 p 
|φt (S) − yt | ≤ max τ 2 , τ φt (S)(1 − φt (S)) .

37
We are ready to prove that our SQ mechanism is accuracy with high probability. This proof will
apply the “monitor” technique of Bassily, Nissim, Smith, Steinke, Stemmer, and Ullman [BNS+ 16]
to set the test query to the SQ with the “worst” response. To select this worst query, the analyst
must have some knowledge of the distribution as they must measure the deviation between yt and
µφt (D).
At first glance, this is strange: If the analyst knew D, they would already know µφt (D) to perfect
accuracy and therefore have no need to use our mechanism. The key is that, while we assume the
analyst knows D as part of the analysis, we use this to conclude that the mechanism’s responses,
y1 , . . . , yT , have good accuracy. Therefore, even if the analyst did not know D, they could use the
the mechanism’s responses to estimate the answer to their queries. We’ll also mention that, from the
perspective of applying Theorem 10, we are free to assume the analyst knows D as the guarantee of
Theorem 10 holds for all distributions and analysts.

Proof of Theorem 3. After asking the adaptive sequence of queries φ1 , . . . , φT and receiving re-
sponses y1 , . . . , yT , the analyst sets the test query to

|yt − µφt (D)|


ψ := φt⋆ where t⋆ := arg max .
t∈[T ] max(τ 2 , τ stdφt (D))

It is sufficient to show that Equation (2) holds for t = t⋆ as, based on how we defined t⋆ , it then
holds for all t ∈ [T ].
In order to execute the mechanism in Figure 1, the analyst makes a total of T k subsampling
queries, one for each vote generated, each with w = 1 and a range |Y | = |{0, 1}| = 2. As a
consequence of Theorem 10, with probability at least 1 − δ,

τ2
     
Tk 1
error(φt⋆ , S, D) ≤ O ln(1/δ) · + = O ln(1/δ) · = τ 2.
n2 n ln(1/δ)

The above implies that


 q 
|µφt⋆ (S) − µφt⋆ (D)| ≤ max τ 2 , τ Varφt⋆ (D) ≤ max τ 2 , τ stdφt⋆ (D) .


We furthermore know by Corollary 9.3 that, with probability at least 1 − δ,


 q 
2
|µφt⋆ (S) − yt⋆ | ≤ max τ , τ µφt⋆ (S)(1 − µφt⋆ (S)) .

Note that it is important that Corollary 9.3 union bounded over all of t = 1, . . . , T because is
therefore guaranteed to apply for t⋆ . We couldn’t have simply bounded the above for t⋆ by directly
applying a Chernoff bound to it, as the selection process for t⋆ depends on the values of y1 , . . . , yT
and is more likely to choose a yt that deviates greatly from its mean.
Finally, union bounding both events that each occur with probability 1 − δ and applying
Proposition 9.1, we have that with probability at least 1 − 2δ,

|µφt⋆ (D) − yt⋆ | ≤ 3 max τ 2 , τ stdφt⋆ (D) .




The desired result follows by renaming δ ′ := δ/2 and τ ′ := τ /3.

38
9.2 The median-finding mechanism: Proof of Theorem 4
In this section, we prove Theorem 4. We begin with the following definition.
Definition 14 (Bad sample). For a query φ : X w → R ⊆ R, and threshold r ∈ R that is not an
approximate median of φ(D), we say a sample S ∈ X n is (φ, r)-bad if, with probability at least 0.45
over x1 , . . . , xw chosen uniformly without replacement from S,

1[φ(x1 , . . . , xw ) ≥ r] ̸= 1[median(µφ (D)) ≥ r].

Intuitively, the proof of Theorem 4 is separated into two pieces: We show that, with high
probability, for all queries φ and thresholds r used in Figure 2, only a small fraction (≤ 0.02k) of
the groups are “bad samples. Then, we show that as long as this is true, with high probability, all
answers output by the mechanism are approximate medians.

Proof of Theorem 4. For each t ∈ [T ] and threshold r ∈ Rt considered by the mechanism, let p(t, r)
be defined as follows: If r is an approximate median of µφt (D), we define p(t, r) = 0. If r is smaller
than all approximate medians of µφt (D), then we define p(t, r) to be the fraction of votes v1 , . . . , vk
that were set to 0 when determining whether yt ≥ r. If r is larger than all approximate medians of
µφt (D), then we define p(t, r) to be the fraction of such votes that were set to 1.
If, for all choices of t, r, it holds that p(t, r) < 0.5, then all yt output by the median mechanism
must be an approximate median. Let t⋆ , r⋆ be the choices maximizing p(t⋆ , r⋆ ). Then, it is
sufficient to prove, with probability at least 1 − δ, that p(t⋆ , r⋆ ) < 0.5. We define the test query
ψ : X wt⋆ → {0, 1},
ψ(x1 , . . . , xwt⋆ ) := 1[φt⋆ (x1 , . . . , xwt⋆ ) ≥ r⋆ ].
For each t ∈ [T ] and i ∈ [k], to generate the votes, we execute ln |Rt | subsampling queries to Si ,
each with domain X wt and range {0, 1}. Let n′ := ⌊n/k⌋ be the number of elements in each group.
Applying Theorem 9 and Markov’s inequality, with probability at least 0.99,
P !
t∈[T ] w t · ln |Rt | 1
error(ψ, Si , D)2 ≤ O + ′ .
(n′ )2 n

Based on how we set the parameters in Theorem 4, the right hand side of the above equation is at
most O(1/wmax ). We are therefore able to conclude that, for each i ∈ [n], with probability at least
0.99,
|µψ (Si ) − µψ (D)| ≤ 0.01.
Therefore, for each group i ∈ [k], the probability that Si is (φt⋆ , r⋆ )-bad is at most 0.01. From here,
we execute a similar proof strategy as that used in Lemma 8.2 to amplify the success probability.
By Lemma 8.4, for any choice of groups G ⊆ [k], the probability that, for every i ∈ [G], Si is
(φt⋆ , r⋆ )-bad is at most 0.01|G| . Therefore, applying the Chernoff bound of Fact 8.5, with probability
at least 1 − exp(−k/300), at most 0.02k groups are (φt⋆ , r⋆ )-bad.
Next, we note that for any single choice of φt and r ∈ Rt , if at most 0.02k groups are (φt , rt )-
bad, then the expected value of p(t, r) is at most 0.47 and, by a Chernoff bound, the probability
p(t, r) ≥ 0.5 is at most exp(−Ω(k)). We chose k large enough to union bound over all T · log2 (Rmax )
choices of t ∈ [T ] and r ∈ Rt . Specifically, with probability at least 1 − δ, for each t ∈ [T ] and rt ∈ R
for which at most 0.02k groups are bad, p(t, r) < 0.5. In particular, this includes p(t⋆ , r⋆ ), proving
the desired result.

39
10 Acknowledgments
The author thanks Li-Yang Tan and Jonathan Ullman for their helpful discussions and feedback.
He furthermore thanks the STOC reviewers for their helpful feedback.
Guy is supported by NSF awards 1942123, 2211237, and 2224246.

References
[AKW22] Jianhang Ai, Ondřej Kuželka, and Yuyi Wang. Hoeffding and bernstein inequalities for
u-statistics without replacement. Statistics & Probability Letters, 187:109528, 2022. 2.3

[BBG18] Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling:
Tight analyses via couplings and divergences. Advances in Neural Information Processing
Systems, 31, 2018. 1.3

[Bla23] Guy Blanc. Subsampling suffices for adaptive data analysis. In Proceedings of the 55th
Annual ACM Symposium on Theory of Computing, pages 999–1012, 2023. 3, A

[BMD+ 05] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John
Lafferty. Clustering with bregman divergences. Journal of machine learning research,
6(10), 2005. 5.3

[BNS+ 16] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan
Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-
eighth annual ACM symposium on Theory of Computing, pages 1046–1059, 2016. 1,
1.2.1, 1.3, 9.1

[CS72] Min-Te Chao and WE Strawderman. Negative moments of positive random variables.
Journal of the American Statistical Association, 67(338):429–431, 1972. 6

[DFH+ 15a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and
Aaron Roth. Generalization in adaptive data analysis and holdout reuse. Advances in
Neural Information Processing Systems, 28, 2015. 1, 2

[DFH+ 15b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis.
Science, 349(6248):636–638, 2015. 1, 2

[DFH+ 15c] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings
of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126,
2015. 1, 1.2.1

[DK22] Yuval Dagan and Gil Kur. A bounded-noise mechanism for differential privacy. In
Conference on Learning Theory, pages 625–661. PMLR, 2022. 1, 1.2.1

[FGR+ 17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao.
Statistical algorithms and a lower bound for detecting planted cliques. Journal of the
ACM (JACM), 64(2):1–37, 2017. 1, 1.2.1

40
[FRR20] Benjamin Fish, Lev Reyzin, and Benjamin IP Rubinstein. Sampling without com-
promising accuracy in adaptive data analysis. In Algorithmic Learning Theory, pages
297–318. PMLR, 2020. 1, 1.3
[FS17] Vitaly Feldman and Thomas Steinke. Generalization for adaptively-chosen estimators
via stable median. In Conference on Learning Theory, pages 728–757. PMLR, 2017. 1,
1, 1.2.2
[FS18] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data
analysis. In Conference On Learning Theory, pages 535–544. PMLR, 2018. 1, 2, 2.2,
2.2, 2.4, 3, 5, 5.1, 7.1, A, 15
[FSG08] Béla A Frigyik, Santosh Srivastava, and Maya R Gupta. Functional bregman divergence
and bayesian estimation of distributions. IEEE Transactions on Information Theory,
54(11):5130–5139, 2008. 5.3
[Hoe48] Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The
Annals of Mathematical Statistics, 19(3):293–325, 1948. 2.3
[Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association, 58(301):13–30, 1963. 6
[HU14] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data
analysis is hard. In 2014 IEEE 55th Annual Symposium on Foundations of Computer
Science, pages 454–463. IEEE, 2014. 1, 1.2.1
[KB80] Rob Kaas and Jan M Buhrman. Mean, median and mode in binomial distributions.
Statistica Neerlandica, 34(1):13–18, 1980. 8.2
[KB13] Vladimir S Korolyuk and Yu V Borovskich. Theory of U-statistics, volume 273. Springer
Science & Business Media, 2013. 2.3
[Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the
ACM (JACM), 45(6):983–1006, 1998. 1, 1.2.1
[Neu66] Peter Neumann. Über den median der binomial- and poissonverteilung. Wis-
senschaftliche Zeitschrift der Technischen Universität Dresden (in German), 19:29–33,
1966. 8.2
[RRST16] Ryan Rogers, Aaron Roth, Adam Smith, and Om Thakkar. Max-information, differential
privacy, and post-selection hypothesis testing. In 2016 IEEE 57th Annual Symposium
on Foundations of Computer Science (FOCS), pages 487–494. IEEE, 2016. 1
[RZ16] Daniel Russo and James Zou. Controlling bias in adaptive data analysis using informa-
tion theory. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the
19th International Conference on Artificial Intelligence and Statistics, volume 51 of
Proceedings of Machine Learning Research, pages 1232–1240, Cadiz, Spain, 09–11 May
2016. PMLR. 1, 2.4
[Sha04] Ronen Shaltiel. Towards proving strong direct product theorems. Computational
Complexity, 12(1/2):1–22, 2004. 2.5, 8.1

41
[Sie01] Alan Siegel. Median bounds and their application. Journal of Algorithms, 38(1):184–236,
2001. 8.2

[Smi17] Adam Smith. Information, privacy and stability in adaptive data analysis. arXiv
preprint arXiv:1706.00820, 2017. 1

[Ste22] Thomas Steinke. Composition of differential privacy & privacy amplification by sub-
sampling. arXiv preprint arXiv:2210.00597, 2022. 1.3

[SU15a] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential
privacy. arXiv preprint arXiv:1501.06095, 2015. 1, 1.2.1

[SU15b] Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness
of preventing false discovery. In Conference on learning theory, pages 1588–1628. PMLR,
2015. 1, 1.2.1

[Ver08] Mark Veraar. A note on optimal probability lower bounds for centered random variables.
arXiv preprint arXiv:0803.0727, 2008. 8.7

[ZW19] Yuqing Zhu and Yu-Xiang Wang. Poission subsampled rényi differential privacy. In
International Conference on Machine Learning, pages 7634–7642. PMLR, 2019. 1.3

A The average leave-one-out KL stability of subsampling queries


A key part of our mutual information bound is the new notion of stability we develop which
generalizes Feldman and Steinke’s notion of average leave-one-out KL stability [FS18].

Definition 15 (Average leave-one-out KL stability, [FS18]). A randomized algorithm M : X n → Y


is ε-ALOOKL stable if there is a randomized algorithm M′ : X n−1 → Y such that, for all samples
S ∈ X n,
E dKL M(S) M′ (S−i ) ≤ ε.
 
i∼[n]

Feldman and Steinke show how to use ALOOKL stability to bound mutual information.

Lemma A.1 (Using ALOOKL stability to bound mutual information). Let S drawn from a
product distribution over X n and, for each t ∈ [T ], draw yt ∼ My1 ,...,yt−1 (S) where My1 ,...,yt−1 is
ε-ALOOKL stable. Then,
I(S; (y, . . . , yT )) ≤ nT ε.

The conference version of this paper [Bla23] used ALOOKL stability to obtain a weaker bound
on the mutual information. A natural question is whether we needed to ALOOKL stability to
ALMOKL stability in order to obtain the improved mutual information bound of Theorem 2. Here,
we show that is indeed the case, and that the conference version’s bound is tight.

Lemma A.2. For any n ≥ 1000, there is a φ : X 1 → {0, 1} which, when applied as a subsampling
query to a sample of size n, is not ε-ALOOKL stable for any ε ≤ ln
2n2
n
.

To recover Theorem 2 (up to constants), we would need to ε ≤ O(1/n2 ).

42
Proof. Let X := {0, 1} and φ be the indicator function. We’ll use the notation ⃗0n to refer to the
tuple containing n-many 0s. Upon receiving the input ⃗0n−1 , M′ let p be that probability that it
outputs 1 (so that M′ (⃗0n−1 ) = Ber(p)). We consider two cases.
Case 1: p > ln n/(2n2 ). In this case, consider S := ⃗0n . We’ll show that,

ln n
E dKL φ(S) M′ (S−i ) ≥ p >
 
.
i∼[n] 2n2

This is the easier case. No matter which element is removed, S−i = ⃗0n−1 , so M′ (S−i ) = Ber(p).
On the other hand, φ(S) = Ber(0). Therefore,

E dKL φ(S) M′ (S−i ) = dKL (Ber(0)∥Ber(p)) = − ln(1 − p) ≥ p.


 
i∼[n]

Case 2: p ≤ ln n/(2n2 ). In this case, consider the sample containing a single 1, S := ⃗0n−1 ◦ 1.
Note that φ(S) = Ber(1/n), and there is a 1/n chance that S−i = ⃗0n−1 , so

E dKL φ(S) M′ (S−i ) ≥ 1/n · dKL (Ber(1/n)∥Ber(p)) .


 
i∼[n]

The above KL divergence is larger the further p is from 1/n. Since ln n/(2n2 ) < 1/n, we can lower
bound,

E dKL φ(S) M′ (S−i ) ≥ 1/n · dKL Ber(1/n) Ber(ln n/(2n2 ))


  
i∼[n]
    
1/n 1 − 1/n
= 1/n · 1/n · ln + (1 − 1/n) · ln
ln n/(2n2 ) 1 − ln n/(2n2 )
   
2n
≥ 1/n · 1/n · ln + 1 · ln(1 − 1/n)
ln n

where in the last inequality, we used that the second term was negative so we are free to increase its
magnitude while maintaining a valid lower bound. Here we use that when n ≥ 1000, ln(1 − 1/n) ≥
−1.001 · 1/n to bound

E dKL φ(S) M′ (S−i ) ≥ 1/n2 · (ln(n) − ln ln(n) − 1.001)


 
i∼[n]

which, using n ≥ 1000, is strictly more than ln(n)/(2n2 ).

43

You might also like