Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data

Nonuniform Negative Sampling and Log Odds
Correction with Rare Events Data
HaiYing Wang Aonan Zhang

Department of Statistics ByteDance Inc.
University of Connecticut [email protected]
arXiv:2110.13048v1 [stat.ML] 25 Oct 2021
[email protected]
Chong Wang
ByteDance Inc.
[email protected]
Abstract
We investigate the issue of parameter estimation with nonuniform negative sam-
pling for imbalanced data. We first prove that, with imbalanced data, the available
information about unknown parameters is only tied to the relatively small number
of positive instances, which justifies the usage of negative sampling. However,
if the negative instances are subsampled to the same level of the positive cases,
there is information loss. To maintain more information, we derive the asymptotic
distribution of a general inverse probability weighted (IPW) estimator and obtain
the optimal sampling probability that minimizes its variance. To further improve
the estimation efficiency over the IPW method, we propose a likelihood-based
estimator by correcting log odds for the sampled data and prove that the improved
estimator has the smallest asymptotic variance among a large class of estimators. It
is also more robust to pilot misspecification. We validate our approach on simu-
lated data as well as a real click-through rate dataset with more than 0.3 trillion
instances, collected over a period of a month. Both theoretical and empirical results
demonstrate the effectiveness of our method.
1 Introduction
Imbalanced data are ubiquitous in scientific fields and applications with binary response outputs,
where the number of instances in the positive class is much smaller than that in the negative class. For
example, in online search or recommendation systems where billions of impressions appear each day,
non-click impressions usually dominate. Using these non-click impressions as negative instances is
prevalent and proved to be useful especially in modern machine learning systems (McMahan et al.,
2013; He et al., 2014; Chen et al., 2016; Guo et al., 2017; Huang et al., 2020). A common approach
to address imbalanced data is to balance it by subsampling the negative class (Drummond et al., 2003;
Liu et al., 2009) and/or up-sampling the positive class (Chawla et al., 2002; Han et al., 2005; Mathew
et al., 2017; Douzas and Bacao, 2017), and a great deal of attentions have been drawn to this problem,
see Japkowicz (2000); King and Zeng (2001); Chawla et al. (2004); Estabrooks et al. (2004); Owen
(2007); Sun et al. (2007); Chawla (2009); Rahman and Davis (2013); Fithian and Hastie (2014);
Lemaître et al. (2017); Wang (2020), and the references therein.
In this paper, we focus on subsampling the negative class, i.e., negative sampling, while keeping
all rare positive instances. This is beneficial for modern online learning systems, where negative
sampling significantly reduces data volume in distributed storage while saving training time for
multiple downstream models. Rare events data and negative sampling are studied in Wang (2020),
35th Conference on Neural Information Processing Systems (NeurIPS 2021).

but it focused on linear logistic regression and only considered uniform sampling. We approach
this problem under general binary response models from the perspective of optimal subsampling,
which aims to minimize the asymptotic variance of the IPW estimator (Wang et al., 2018; Ting and
Brochu, 2018; Wang and Ma, 2021; Yu et al., 2020). This topic was not well investigated neither
for imbalanced data nor for negative sampling. In addition, existing optimal probability formulation
minimizes the conditional asymptotic variance and the variations due to the data randomness are
ignored. We fill these gaps by deriving the asymptotic distributions for both full data and subsample
estimators under general binary response models, and we find the unconditional optimal probability
for negative sampling. In addition, since IPW may inflate the variance (Hesterberg, 1995; Owen and
Zhou, 2000), we develop a more efficient likelihood-based estimator through nonuniform log odds
correction to avoid IPW. Our main contributions are summarized as follows.
• Under a general binary response model with rare events data, we find that the difference
between the full data estimator and the true parameter converges to a normal distribution at
a rate that is tied to the number of rare positive instances only. This indicates the possibility
to throw away the majority of negative instances without information loss, i.e, it justifies the
usage of negative sampling.
• We show that there is no asymptotic information loss with aggressive negative sampling if
the negative instances kept dominates the positive instances. However there is information
loss expressed as a variance inflation if the negative instances are brought down to the same
level of the positive instances. For this case, we obtain optimal subsampling probabilities by
minimizing the unconditional asymptotic variance of a general IPW estimator.
• We develop a likelihood estimator through nonuniform log odds correction for sampled data,
and prove that it has the smallest asymptotic variance among a large class of estimators.
• We apply the proposed method to a real online streaming click-through rate (CTR) dataset
with more than 0.3 trillion instances and demonstrate its effectiveness.
The rest of paper is organized as follows. Section 2 defines the problem this paper focuses on
and shows the full data results. Section 3 presents the optimal probability for the IPW estimator.
Section 4 proposes the likelihood-based estimator and establishes its asymptotic optimality. Section 5
discusses practical implementation and the theoretical properties of the resulting estimators. Section 6
presents numerical experiments with simulated data, and applies the proposed method to a real online
streaming CTR dataset with more than 0.3 trillion instances. Section 7 concludes the paper and points
out limitations of our investigation. Proofs and additional experiments with misspecifications are in
the supplementary materials.
2 Problem setup and full data results

Let {(xi , yi )}N
i=1 be training data of size N that satisfies the binary response model,
Pr(y = 1 | x) = p(x; θ), (1)

where y ∈ {0, 1} is the binary class label (y = 1 for the positive and y = 0 for the negative), x is the
vector of features, and θ is the unknown d-dimensional parameter.
PN
Let N1 be the number of positive instances (N1 = i=1 yi ) and N0 be the number of negative
instances (N0 = N − N1 ). We consider the scenario of massive imbalanced data, i.e., N1 N0 . For
asymptotic investigations when N1 is much smaller than N0 , it is more appropriate to assume that
N1 increases in a slower rate compared with N0 , i.e., N1 /N0 → 0 in probability as N → ∞ (Wang,
2020). This requires Pr(y = 1) = E{p(x; θ)} → 0 as N → ∞ on the model side. We assume that
the parameter θ contains two components θ = (α, β T )T and the log odds can be written as
n p(x; θ) o
g(x; θ) := log = α + f (x; β),
1 − p(x; θ)
where the true parameter, denoted as θ ∗ = (α∗ , β ∗ T )T , satisfies that α∗ → −∞ as N → ∞ and

β ∗ is fixed. Here f (x; β) is a smooth function of β, such as a linear function or a nonlinear neural
network model. If it is linear, the model reduces to the logistic regression model. A diverging α∗ and
a fixed β ∗ indicate that both the marginal and conditional probabilities for a positive instance are
2
small. Heuristically, this implies that a change in the feature value does make a rare event become a
large probability event. We can also allow β ∗ to change with N , but as long as β ∗ has a finite limit,
the situation is essentially the same as a fixed β ∗ . So here we assume β ∗ to be fixed to simplify the
presentation.
Throughout this paper, we use ġ(x; θ) and g̈(x; θ) to denote the gradient and Hessian of g(x; θ) with
respect to (w.r.t.) θ, respectively. Let kvk be the Frobenius norm and v⊗2 = vvT for a vector or a
matrix v. We denote π(x, y) as the sampling probability for an instance (x, y). For negative cases,
we sometimes use a shorthand notation π(x) to denote its sampling probability. We use oP (1) to
represent a random quantity that converges to 0 in probability as N → ∞.
2.1 It is the number of positive instances that matters
We show that for rare events data, the estimation error rate for θ is related to N1 instead of N . Using
the full data, a commonly used estimator of θ is the maximum likelihood estimator (MLE)
N
X
yi g(xi ; θ) − log 1 + eg(xi ;θ) .

θ̂f := arg max
θ
i=1
To investigate the theoretical properties of the estimator θ̂f , we make the following assumptions.
Assumption 1. The first, second, and third derivatives of f (x; β) and ef (x;β) f (x; β) w.r.t. any
components of β are bounded by a square intergrable random variable B(x).
Assumption 2. The matrix E{ġ ⊗2 (x; θ ∗ )} is finite and positive definite.
Assumption 1 imposes constraints on the smoothness of f (x; θ) w.r.t. θ and on the tightness of the
distribution of x. Assumption 2 ensures that all components of θ are estimable.
∗ ∗
Theorem 1. Let Vf = E{ef (x;β ) }M−1
f where Mf = E{ef (x;β ) ġ ⊗2 (x; θ ∗ )}. Under Assump-
tions 1 and 2, as N → ∞,
p
N1 (θ̂f − θ ∗ ) −→ N(0, Vf ), in distribution.
This result shows that even if all the N instances are used for model training, the resulting estimator
−1/2
θ̂f converges to the true parameter θ ∗ at the rate of N1 , which is much slower than the usual rate
−1/2
of N for regular cases. This indicates that with rare events data, the available information about
unknown parameters is at the scale of N1 . Although N → ∞ much faster than N1 → ∞ (in terms
of N1 /N → 0), N does not explicitly show in the asymptotic normality result of Theorem 1. For
the specific case of linear logistic regression with g(x; θ) = α + xT β, our Theorem 1 reduces to
Theorem 1 of Wang (2020).
2.2 Negative sampling
Since the available information ties to the number of positive instances instead of the full data size,
one can keep all the positive instances and significantly subsmaple the negative instances to reduce
the computational cost. For better estimation efficiency, we consider nonuniform sampling. Let
ϕ(x) > 0 be an integrable function and ρ be the sampling rate on the negative class. Without loss
of generality, assume E{ϕ(x)} = 1 so that π(xi ) := ρϕ(xi ) is the sampling probability for the i-th
data point if yi = 1. We present the nonuniform negative sampling procedure in Algorithm 1.
Algorithm 1 Negative sampling

For i = 1, ..., N : if yi = 1, include {xi , yi , π(xi , yi ) = 1} in the sample;
if yi = 0, calculate ϕ(xi ) and generate ui ∼ U(0, 1);
if ui ≤ ρϕ(xi ), include {xi , yi , π(xi , yi ) = ρϕ(xi )} in the sample.
3
3 Weighted estimation and its optimal negative sampling probability
Let δi = 1 if the i-th data point is selected and δi = 0 otherwise. Given a subsample taken according
to π(xi , yi ) = yi + (1 − yi )ρϕ(xi ), i = 1, ..., N , the IPW estimator of θ is
N
X
δi π −1 (xi , yi ) yi g(xi ; θ) − log 1 + eg(xi ;θ) .

θ̂w = arg max (2)
θ
i=1
We need an assumption on the subsampling rate to investigate the subsample estimators.
∗
Assumption 3. The subsampling rate ρ satisfies that cN := eα /ρ → c for a constant c such that
0 ≤ c < ∞.
∗
Note that cN cannot be zero but its limit c can be exactly zero. It can be shown that cN E{ef (x;β ) } =
∗
N1 (N0 ρ)−1 {1 + oP (1)}, so cE{ef (x;β ) } is the asymptotic positive/negative ratio for the sample.
The theorem below shows the asymptotic distribution of θ̂w .
Theorem 2. Under Assumptions 1-3, if E[{ϕ(x) + ϕ−1 (x)}B 2 (x)] < ∞ then as N → ∞,
p
N1 (θ̂w − θ ∗ ) −→ N(0, Vw ), in distribution,
∗ ∗
where Vw = Vf + Vsub and Vsub = cE{ef (x;β ) }M−1
f E{ϕ
−1
(x)e2f (x;β ) ġ ⊗2 (x; θ ∗ )}M−1
f .
The subsampled estimator θ̂w have the same convergence rate as the full data estimator. However,
the asymptotic variance Vw may be inflated by the term Vsub from subsampling the negative cases.
Here Vsub is zero if c = 0. If we keep much more negative instances than the positive instances in
the subsample (c = 0), then there is no asymptotic information loss due to subsampling (Vw = Vf ).
In this scenario, the number of negative instances can still be aggressively reduced (ρ → 0) so that the
computational efficiency is significantly improved, and the sampling function ϕ(x) does not play an
significant role. If we have to reduce the negative instances to the same level of the positive instances
(0 < c < ∞), then the variance inflation Vsub is not negligible. In this scenario, a well designed
sampling function ϕ(x) it is more relevant and critical.
√
Theorem 2 shows that the distribution of the estimation error err := N1 (θ̂w − θ ∗ ) is approximated
by a normal distribution N(0, Vw ). Thus for any > 0 the probability of excess error Pr(err > )
Pd 2

is approximated by Pr{kN(0, Vw )k > } = Pr j=1 λj zj > , where λj ’s are eigenvalues of
Vw and zj ’s are independent standard normal random variables. Therefore, a smaller trace of Vw
Pd
means a smaller probability of excess error of the same level since tr(Vw ) = j=1 λj .
The following theorem gives the optimal negative sampling function for the IPW estimator.
Theorem 3. For a given sampling rate ρ, the asymptotically optimal ϕ(x) that minimizes tr(Vw ) is
min{t(x; θ ∗ ), T }
ϕos (x) = , (3)
E[min{t(x; θ ∗ ), T }]
where t(x; θ) = p(x; θ)kM−1 f ġ(x; θ)k and T is the maximum number so that
ρ[min{t(x; θ ∗ ), T }] ≤ E[min{t(x; θ ∗ ), T }] with probability one.
Remark 1. If ρ → 0 then ρt(x; θ ∗ ) ≤ E{t(x; θ ∗ )} almost surely, so T can be dropped (i.e., T = ∞).
If limN →∞ ρ > 0 then c = 0, so the variance inflation due to subsampling is negligible. Thus, the
truncation term T can be ignored in practice with imbalanced data; it only plays an role for not very
imbalanced data when the sampling ratio is not very small.
In ϕos (x), t(x; θ ∗ ) consists of two components, p(x; θ ∗ ) and kM−1 ∗

f ġ(x; θ )k. The term p(x; θ )
∗
is from the binary response model structure, and it gives a higher preference for a data point with
larger p(x; θ ∗ ) in the negative class. The term kM−1 ∗
f ġ(x; θ )k corresponds to the A optimality
criterion (Pukelsheim, 2006). For computational benefit in optimal sampling, the A optimality is
often replaced by the L optimality (e.g., Wang et al., 2018; Ting and Brochu, 2018). Under the
L optimality, kM−1 ∗ ∗
f ġ(x; θ )k is replaced by kġ(x; θ )k, and this minimizes the asymptotic mean
squared error (MSE) of Mf θ̂w . In case the gradient is difficult to obtain, one could ignore the gradient
term and simply use p(x; θ ∗ ) to replace t(x; θ ∗ ). Although this does not give an optimal sampling
probability, it outperforms the uniform subsampling because it takes into account the information of
the model structure like the local case-control (LCC) sampling.
4
4 More efficient estimation based on the likelihood with log odds correction
4.1 Nonuniform log odds correction
The optimal function ϕos (x) assigns larger probabilities to more informative instances. However,
the IPW estimator in (2) assigns smaller weights to more informative instances in estimation, so
the resulting estimator can be improved to have higher estimation efficiency. A naive unweighted
estimator is biased, and the bias correction approach in Fithian and Hastie (2014); Wang (2019)
does not work for the sampling procedure in Algorithm 1 even when the underlying model is the
logistic regression. We adopted the idea of Han et al. (2020) and seek to define a more efficient
estimator through finding the corrected likelihood of the sampled data based on any negative sampling
probability π(x). For data included in the subsample (where δ = 1), the conditional probability of
y = 1 is
1
pπ (x; θ) := Pr(y = 1 | x, δ = 1) = −g(x;θ)−l
, (4)
1+e
where l = − log{π(x)}. Please see the detailed derivation of (4) in the supplement. To avoid the
IPW, we propose the estimator based on the log odds corrected likelihood in (4), namely,
N
X
δi yi g(xi ; θ) − log 1 + eg(xi ;θ)+li .

θ̂lik = arg max `lik (θ), where `lik (θ) = (5)
θ
i=1
With θ̂w in (2), π(xi ) is in the inverse. If an instance with much smaller π(xi ) is selected in the
sample, it dominates the objective function, making the resulting estimator unstable. With θ̂lik , this
problem is ameliorated because π(xi ) is in the logarithm in the log-likelihood of (5) and log(v) goes
to infinity much slower than v −1 as v ↓ 0.
4.2 Theoretical analysis of the likelihood-based estimator
The following shows the asymptotic distribution of θ̂lik .

∗
Theorem 4. Under Assumptions 1-3, if E{ef (x;β ) ϕ−1 (x)B(x)} < ∞, then as N → ∞,
p
N1 (θ̂lik − θ) −→ N(0, Vlik ) in distribution,
h f (x;β∗ ) ⊗2
ġ (x;θ ∗ )
∗
i
e
where Vlik = E{ef (x;β ) }Λ−1
lik and Λ lik = E −1
1+cϕ (x)e f (x;β ∗) .
Theorem 3 shows that the proposed estimator θ̂lik is asymptotic normal with variance Vlik . We
compare Vlik with Vw to show that θ̂lik has a higher estimation efficiency than θ̂w .
Theorem 5. If Vw and Vlik are finite and positive definite, then Vlik ≤ Vw , i.e., Vw − Vlik is
non-negative definite. The equality holds when c = 0 and in this case Vlik = Vw = Vf .
Since the estimator θ̂lik in Eg. (5) is based on the conditional log-likelihood of the sampled data, it
actually has the highest estimation efficiency among a class of asymptotically unbiased estimators.
Theorem 6. Let X = (x1 , ..., xN )T be the feature matrix and Dδ denote the sampled data. Let θ̂U
be a subsample estimator with the following asymptotic representation:
−1/2
θ̂U = U(θ ∗ ; Dδ ) + N1 oP (1), (6)
where the variance V{U(θ ∗ ; Dδ )} exists, and U(θ ∗ ; Dδ ) satisfies that E{U(θ ∗ ; Dδ ) | X} = θ ∗
and E{U̇(θ ∗ ; Dδ ) | X} = 0 with U̇(θ ∗ ; Dδ ) = ∂U(θ ∗ ; Dδ )/∂θ ∗ T . If N1 V{U(θ ∗ ; Dδ )} → VU
in probability, then VU ≥ Vlik .
Remark 2. Theorem 6 tells us that θ̂lik is statistically the most efficient among a class of estimators,
which includes both θ̂w and θ̂lik as special cases. The condition E{U(θ ∗ ; Dδ ) | X} = θ ∗ ensures that
the estimator is asymptotically unbiased. The condition E{U̇(θ ∗ ; Dδ ) | X} = 0 can be intuitively
interpreted as the requirement that the derivative of the oP (1) term in (6) is also negligible. Clearly,
this is satisfied by all unbiased estimators for which the oP (1) term in (6) is 0.
5
5 Practical considerations
In this section, we first show how we design a more practical estimator based on the previous results
and then present the theoretical analysis behind these improvements.
5.1 Making estimators more practical
Like the LCC sampling (Fithian and Hastie, 2014) and the A-optimal sampling (Wang, 2019), the
ϕos (x) in (3) depends on θ, and thus a pilot value of θ, say θ̃, is required in practice. Here, θ̃ can
be constructed from a pilot sample, say (x̃i , ỹi ), i = 1, ..., ñ, taken by uniform sampling from both
classes with equal expected sample sizes. With θ̃ obtained, calculate t̃i = p(x̃i ; θ̃)kM̃−1 f ġ(x̃i ; θ̃)k
for i = 1, ..., ñ, where M̃f is the Hessian matrix of the pilot sample objective function. In case
that the Hessian matrix and gradients are hard to record, the gradient term can be dropped, i.e.,
use t̃i = p(x̃; θ̃). As mentioned in Remark 1, T can be ignored for very imbalanced data. The
expectation in the denominator of (3) can be approximated by mimicking the method of moment
estimator, namely by
ω̃ = 2N1 (ñN )−1 ỹi =1 t̃i + 2N0 (ñN )−1 ỹi =0 t̃i .
P P
(7)
In practice, it is common to set a lower threshold, say %, on sampling probabilities, to ensure that all
instances have positive probabilities to be selected. This is more critical for the IPW estimator while
the likelihood estimator is not very sensitive to very small sampling probabilities.
Denote the pilot value as ϑ̃ = (θ̃, ω̃). The following probabilities can be practically implemented
π%os (x; ϑ̃) = min[max{ρϕ̃os (x), %}, 1], where ϕ̃os (x) = ω̃ −1 t(x; θ̃), (8)
and we use this in our experiments in Section 6. The ω̃ in (7) is essentially a normalizing term so
that the expected subsample size is ρ. For some practical systems such as online models, ω̃ is often
treated as a tuning parameter. We will illustrated this in our real data example in Section 6.2.
5.2 Theoretical analysis of practical estimators

ϑ̃ ϑ̃
Denote θ̂w and θ̂lik the IPW estimator and the likelihood estimator, respectively, with practically
estimated optimal probability in (8). We have the following results.
Theorem 7. Assume that the pilot estimator ϑ̃ is independent of the data, ϕ̃os (x) → ϕplt (x) in
probability, and ρ−1 % → cl > 0. Under Assumptions 1-3, as N → ∞, the θ̂wϑ̃
satisfies that,
p
ϑ̃ plt
N1 (θ̂w − θ) → N(0, Vw ) in distribution,
plt
where Vw has the same expression of Vw except that ϕ(x) is replaced by max{ϕplt (x), cl }. If
plt
Pr{ϕplt (x) ≥ cl } = 1 and the pilot estimator is consistent, then Vw achieves the optimal variance.
Theorem 8. Assume that the pilot estimator ϑ̃ is independent of the data, ϕ̃os (x) → ϕplt (x) in
ϑ̃
probability, and % = o(ρ). Under Assumptions 1-3, as N → ∞, θ̂lik satisfies that,
plt
p
ϑ̃
N1 (θ̂lik − θ) → N 0, Vlik in distribution,
plt
where Vlik has the same expression of Vlik except that ϕ(x) is replaced by ϕplt (x). If the pilot
plt
estimator is consistent, then Vlik achieves the variance with the optimal probability.
Remark 3. Theorems 7 and 8 do not require ϑ̃ to be consistent, i.e., the pilot estimator can be
misspecified. However ϑ̃ has to be consistent in order to achieve the asymptotic variances with the
ϑ̃ ϑ̃
optimal probability. Compared with the likelihood estimator θ̂lik , the IPW estimator θ̂w requires a
os
stronger condition on % to have asymptotic normality, because π% (x; ϑ̃) is in the denominator of the
ϑ̃
objective function for θ̂w so the turbulence of ϑ̃ is amplified. The requirement cl > 0 means that %
cannot be too small in practice. The likelihood estimator does not have this constraint.
6
6 Numerical experiments
We present simulated results and report the performance on a CTR dataset with more than 0.3 trillion
instances. More experiments with model and pilot misspecifications are available in the supplement.
6.1 Experiments on simulated data
We implemented the following negative sampling methods for logistic regression. 1) uniW: uniform
sampling with the IPW estimator. 2) uniLik: uniform sampling with the likelihood estimator. 3)
optW: optimal sampling with the IPW estimator. 4) optLik: optimal sampling with the likelihood
estimator. We also implement the full data MLE (Full) and the LCC for comparisons.
In each repetition of the simulation, we generate full data of size N = 5 × 105 from the logistic
regression with g(x; θ) = α + xT β. We set the true β ∗ to be a 6 × 1 vector of ones and set different
values for α for different feature distributions so that the positive/negative ratio is close to 1:400. We
consider the following feature distributions: (a) Normal distribution which is symmetric with light
tails; the true intercept is α = −7.65. (b) Log-normal distribution which is asymmetric and positively
skewed; the true intercept is α = −0.5. (c) t3 distribution which is symmetric with heavier tails; the
true intercept is α = −7.
We set the sampling rate as ρ = 0.002, 0.004, 0.006, 0.01, and 0.02 for all sampling methods. In each
repetition, uniform samples of average size 100 are selected from each class to calculate the pilot
estimates, so the uncertainty due to the pilot estimates are taken into account. We also consider pilot
misspecification by adding a uniform random number from U(0, 1.5) to the pilot so that the pilot is
systematically different from the true parameter. We repeat the simulation for R = 1000 times to
PR
calculate the MSE as R−1 r=1 kθ̂ (r) − θk2 , where θ (r) is the estimate at the r-th repetition.
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3
Figure 1: Log(MSEs) for different estimators (the smaller the better). The top row uses consistent
pilot estimators; the bottom uses misspecified pilot estimators.
Results are presented in Figure 1. The MSEs of all estimators decrease as the sampling rate ρ
increases. The optLik outperforms other sampling methods in general, and its performance is close to
the full data MLE when ρ = 0.02, especially if the pilot is consistent. The advantage of optLik over
optW is more evident for smaller ρ. If the pilot is misspecified, optLik is quite robust but optLik and
LCC significantly deteriorate. Since uniLik and uniW do not require a pilot, they are not affected
by pilot misspecification, but they have much larger MSE than optLik and optW when the pilot is
consistent, especially uniW. Different feature distributions also affect the performances of different
sampling methods.
To verify Theorem 1 numerically, we let the full data size N increase at a faster rate than the average
number of positive instances N1a = E(N1 ) so that the probability Pr(y = 1) decreases towards zero.
As N increases, we fixed the value of β ∗ and set decreasing values to α∗ to mimic our scaling regime.
We considered two covariate distributions: 1) a multivariate normal distribution for which a logistic
7
regression model is a correct model; and 2) a multivariate log-normal distribution for which a logistic
regression model is misspecified. We simulated for 100 times to calculate the empirical variance
V̂f of the full data MLE, and report the results in Table 1. It is seen that tr(V̂f ) is decreasing
towards zero. According to Theorem 1, tr(V̂f ) should converge in a rate of 1/N1 , so N1a tr(V̂f )
should be relatively stable as N increases. This is indeed the case as seen in the third and sixth
columns of Table 1. On the other hand, N tr(V̂f ) gets large dramatically as N increases. The
aforementioned observations confirm the theoretical result in Theorem 1. Furthermore, Table 1 shows
that the convergence rate in Theorem 1 may also be true for some misspecified models.
Table 1: Empirical variances of the full data MLE for different full data size and average number of
positive instances combinations.
Correct model Mis-sprcified model
(N , N1a ) tr(V̂f ) N1a tr(V̂f ) N tr(V̂f ) tr(V̂f ) N1a tr(V̂f ) N tr(V̂f )
(103 , 32) 0.169 5.41 169.17 0.969 30.99 968.70
(104 , 64) 0.097 6.20 969.29 0.322 20.59 3217.12
(105 , 128) 0.045 5.76 4497.24 0.135 17.32 13527.60
(106 , 256) 0.018 4.62 18048.40 0.046 11.74 45847.40
6.2 Experiments on real data
We conduct experiments on an internal CTR dataset from our products with more than 0.3 trillion
instances. There are 299 input features, including user and ad profiles together with rich context
features. We concatenate all features as a single high-dimensional vector x, whose length is greater
than 5, 000. We use a 3-layer fully connected deep neural network as a feature extractor that maps x
into a 256-dimensional dense features h(x). The downstream model is a linear logistic regression
model taking h(x) as input to predict the binary output whether the user clicks the ad or not.
The positive/negative ratio is around 1:80. Due to limited storage, we first uniformly drop 80%
negative instances and try various negative sampling algorithms on the rest 60 billion instances. On
this pre-processed dataset, we then split out 1 percent of the data for testing, and do negative sampling
on the rest data. The entire training procedure scans the subsampled data set from begin to end in
one-pass according to the timestamp. The testing is done all the way through training. We calculate
the area under the ROC curve (AUC) for testing instances. The entire model is trained on a distributed
machine learning platform using 1,200 CPU cores and can finish within two days.
We adopt the sampling probability π%os (x; ϑ̃) as demonstrated in (8) and use the empirical sampling
rate to calibrate the normalizing term ω̃. As mentioned before, we remove the gradient term for
implementation tractability so t(x; θ̃) = p(x; θ̃). As the result will show, this approximate score
produces consistent results with the simulation experiments. We compare four methods: 1) uniW and
2) uniLik are the same as in simulated experiments; and 3) optW and 4) optLik represent nonuniform
sampling probability using π%os (x; ϑ̃) with the IPW estimator and with the likelihood estimator. We
use “opt" to refer to both optW and optLik when focusing on the sampling.
We study two sets of negative sample rates (w.r.t. the original 0.3 trillion data): (i) [0.01, 0.05]. In this
case, negative instances still outnumber positive instances. (ii) [0.001, 0.008]. In this case, negative
subsampling is too aggressive that positive instances dominates negative instances. We demonstrate
the result in Figure 2. When the sample rate is moderate (Figure 2.(b)), opt is nearly optimal and can
significantly outperform uniform subsampling. There is still a small gap between the IPW estimator
and the likelihood estimator. When the sample rate is extremely small (Figure 2.(a)), opt becomes
sub-optimal and its gap w.r.t. uniform sampling is closer. Still we see a clear difference between the
IPW estimator and the likelihood estimator. This is due to a larger asymptotic variance for uniform
sampling when the sampling rate goes to zero. Note that on this huge data set, a small relative AUC
gain (e.g., 0.0005) usually could bring about significant revenue gain in products.
6.3 Sensitivity analysis
In practice, we find it necessary to use cross-validation to find the optimal % in (8) given a fixed sample
rate. As demonstrated in Figure 3, for each sampling rate we tune % from a very small value (in this
8
(a). Extremely small sampling rates. (b) Moderate sampling rates.
Figure 2: Empirical testing AUC of subsample estimators for different sample sizes (the larger the
better).
case, 10−6 ) all the way to its largest value. When % achieves its largest value, ρω̃ −1 = 0 and π%os (x; ϑ̃)
reduces to uniform negative sampling. We observe that π%os (x; ϑ̃) with a moderate % achieves the
best results for various sampling rates. Note that % is the lower truncation level for negative sampling
probabilities. Thus, a larger % means more “simple" negative instances (ρω̃ −1 p(xi ; θ̃) is smaller than
%) will have better chances to be selected. This coincides with recent empirical results for negative
subsampling in large search systems (Huang et al., 2020), where they mix “simple" (smaller p(xi ; θ̃))
and “hard" (larger p(xi ; θ̃)) negative instances. We also observe an empirical variance of around
0.0001 for each setup, demonstrating that the relative improvement is consistent.
uniW uniLik optW optLik
1.0
Sample Rate = 0.05 Sample Rate = 0.04 Sample Rate = 0.03 Sample Rate = 0.02 Sample Rate = 0.01
6
13 13 13 14 14 14 14
0.8 0.8 0.8 0.8 0.8 0.8 0.8
4 6 8 0 2 4 6 8
13 13 13 14 14 14 14 14
00 25 50 75 00 25 50
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
4
4
15 .815 .815
15
12 12 12 12 13 13 13
0.8
0.8 0.8 0.8 0.8 0.8 0.8 0.8

0
0.8
2
2
8
14
0.8
0
0
0
6
14
0.8
0.8
8
8
14
4
14
0.8
6
0.8
0.6
46
81
4
14
AUC 0.
0.8
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.01 0.02 0.03 0.04 0.05
Sample Rate = 0.008 Sample Rate = 0.006 Sample Rate = 0.004 Sample Rate = 0.002 Sample Rate = 0.001
0.4
0
12
50 75 00 25 50 75 00 25
0.8
11 11 12 12 12 12 13 13
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
0
12
0
07 08 08 09 09 10
05 06 06 07 07 08
0.8
0.8 0.8 0.8 0.8 0.8 0.8
0.8 0.8 0.8 0.8 0.8 0.8

11
0.8
5
20
0
1
11
0.8
0.2
0
0.8
15
05
5
5
1
1
0.8
0.8
0
0
0
10
0
11
0.8
0.8
5
5
0.0
5
09
0.0
0.00 0.01 0.02 0.03 0.04 0.20.000 0.005 0.010 0.015 0.020 0.025 0.030 0.4 0.000 0.005 0.010 0.015 0.020 0.6 0.000 0.002 0.004 0.006 0.008 0.0100.8 0.000 0.001 0.002 0.003 0.004 0.005
1.0
5
0.8
10
0.8
Figure 3: AUC for IPW and likelihood estimators when tuning the truncation lower bound %.
7 Conclusions, discussion, and limitations

In this paper, we have derived asymptotic distributions for full data and subsample estimators
with rare events data. We have also found the optimal sampling probability that minimizes the
unconditional asymptotic variance of the IPW estimator, and proposed the likelihood estimator
through nonuniform log odds correction to further improve the estimation efficiency. The optimal
probability depends on unknown parameters so we have discussed practical implementations and
investigated the convergences of the resultant estimators. Experiments on both simulated and real
big online streaming data confirm our theoretical findings. Heuristically, our nonuniform negative
sampling method gives preference to data points that are harder to observe, so it would give preference
to sub-populations that are difficult to observe. We do not think this would bring any negative societal
impacts. We have also examined our real data application and did not notice discrimination against
any sub-populations.
This paper has the following limitations: 1) We assume that the model is correctly specified. Theoret-
ical properties with model misspecification is not considered, and this important question requires
9
future studies. 2) Oversampling the rare positive instances is another common practice to deal with
imbalanced data. Its application together with negative sampling is not considered. 3) The assumption
of a fixed β ∗ means that both the marginal and conditional probabilities for a positive instance to
occur are small. If the proportion of the positive instances for a given feature in a dataset is large,
this assumption may not be appropriate. Further investigation is need to see if our results is still
applicable. In the following, we present three scenarios that the scaling regime may or may not fit in:
Scenario 1 (phone call while driving) Car crashes occur with small probabilities, and making phone
calls while driving significantly increases the probability of a car crash. However, the probability
of car crashes among people making phone calls while driving is still small, so for these types of
features, our scaling regime is appropriate to model the rare events.
Scenario 2 (anti-causal learning) Anti-causal learning Schölkopf et al. (2012); Kilbertus et al. (2018)
assumes that label (y) causes observations (x). Thus Pr(x|y) represents the causal mechanism and
is fixed. One standard example is that diseases (y) cause symptoms (x). Our scaling regime fits the
framework of anti-causal learning. To see this, using Bayes Theorem we write the log odds as
Pr(y = 1) Pr(x | y = 1)
α + f (x; β) = log + log .
Pr(y = 0) Pr(x | y = 0)
In anti-causal learning, the marginal distribution of y changes while the conditional distribution of x
given y is fixed. Thus only the scale factor α changes, and f (x; β) is fixed.
Scenario 3 (local transmission of the COVID-19) Our scaling regime has some limitations and does
not apply to all types of rare events data. For example, although the COVID-19 infection rate for the
whole population is low, the infection rate for people whose family members are in the same house
with positive test results is high. This means the change of a family member’s test result converts a
small-probability-event to a large-probability-event, and our scaling regime would not be appropriate.
Acknowledgments and Disclosure of Funding

HaiYing Wang’s research was partially supported by NSF grant CCF 2105571.
10
Proofs and Additional Numerical Experiments
Before presenting the proof, we point out that Assumption 2 implies that E{v(x)ġ ⊗2 (x; θ)} is
positive definite for a positive function v(x) > 0 almost surely. This is because Assumption 2 means
that Pr{lT ġ(x; θ) 6= 0} > 0 for any l 6= 0, and therefore Pr{v 1/2 (x)lT ġ(x; θ) 6= 0} > 0 for any
l 6= 0, implying that E{v(x)ġ ⊗2 (x; θ)} is positive definite. We also point out that if a sequence is
oP (1) conditionally then it is also oP (1) unconditionally and vice versa (Xiong and Li, 2008; Cheng
et al., 2010).
√
To facilitate the presentation, define aN = N eα∗ . We notice that
α∗ +f (x;β∗ )
e ∗
N1 = N E α ∗ +f (x;β ∗ ) {1 + oP (1)} = a2N E{ef (x;β ) }{1 + oP (1)}, (S.1)
1+e
√
from the dominated convergence theorem. Thus the normalizing term N1 for the asymptotic
∗
normality in the paper can be replaced by aN E1/2 {ef (x;β ) }.
S.1 Proof of Theorem 1

The estimator θ̂f is the maximizer of
N
X
yi g(xi ; θ) − log{1 + eg(xi ;θ) } ,

`(θ) =
i=1
∗
so aN (θ̂ − θ ) is the maximizer of
γ(u) = `(θ ∗ + a−1 ∗
N u) − `(θ ).
By Taylor’s expansion,
N
X
T˙ ∗
γ(u) = a−1 −2
N u `(θ ) + 0.5aN φ(xi ; θ ∗ ){uT ġ(xi ; θ ∗ )}2 + ∆` + R,
i=1
where
N
˙ ∂`(θ) X
`(θ) = = yi − p(xi ; θ) ġ(xi ; θ),
∂θ i=1
N
1 X
yi − p(xi ; θ ∗ ) uT g̈(xi ; θ ∗ )u,

∆` =
2a2N i=1
and R is the remainder term. By direct calculation,
N
1 X
φ(xi ; θ ∗ + a−1 ∗ −1 ∗ −1
T 3
R= 3 N ú) 1 − 2p(xi ; θ + aN ú) {u ġ(xi ; θ + aN ú)}
6aN i=1
N
1 X
+ φ(xi ; θ ∗ + a−1 T ∗ −1 T ∗ −1
N ú){u ġ(xi ; θ + aN ú)}{u g̈(xi ; θ + aN ú)u}
6a3N i=1
N d
1 X ∗ −1
X ... ∗ −1
+ 3 yi − p(xi ; θ + aN ú) g
uj1 uj2 j1 j2 (xi ; θ + aN ú)u ,
6aN i=1 j ,j =1
1 2
...
where g j1 j2 (x; θ) is the gradient of g̈j1 j2 (x, θ) and ú lies between 0 and u. We see that
N N
kuk3 X ∗ −1 ∗ −1 dkuk3 X
|R| ≤ p(xi ; θ + aN ú)C(x i , θ + aN ú) + yi B(xi )
6a3N i=1 6a3N i=1
−1 N N
kuk3 eaN kuk X −1 ∗ −1 dkuk3 X
≤ exp{f (xi ; β + aN ú(−1) )}C(xi , θ + aN ú) + yi B(xi )
6N aN i=1
6a3N i=1
11
−1 N N
kuk
d3 kuk3 eaN X dkuk3 X
≤ B(xi ) + yi B(xi ) ≡ ∆R1 + ∆R2 , (S.2)
6N aN i=1
6a3N i=1
Pd ...
where C(x, θ) = kġ(x, θ)k3 + kġ(x, θ)kkg̈(x, θ)k + j1 j2 k g j1 j2 (x, θ)k and ú(−1) is ú with the
first element removed. From Assumption 1, E(∆R1 ) → 0 and
N
dkuk3 X f (xi ;β∗ ) dkuk3 f (x;β∗ )
E(∆R2 ) ≤ E e B(xi )} = E e B(x) → 0.
6N aN i=1 6aN
Since ∆R1 and ∆R2 are both positive, Markov’s inequality shows that they are both oP (1) so
R = oP (1). For ∆` , the mean E(∆` ) = 0 and the variance satisfies that
N N
kuk4 X ∗ ∗ 2 kuk4 X ∗
V(∆` ) ≤ E{p(x i ; θ )kg̈(x i ; θ )k } ≤ E[ef (xi ;β ) kg̈(xi ; θ ∗ )k2 ]
4a4N i=1 4N a2N i=1
kuk4 ∗
= 2 E[ef (x;β ) kg̈(x, θ ∗ )k2 ] → 0,
4aN
so ∆` = oP (1).
If we can show that
a−1 ˙ ∗

N `(θ ) −→ N 0, Mf , (S.3)
in distribution, and
N
X
a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Mf , (S.4)
i=1
in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that
aN (θ̂ − θ ∗ ), the maximizer of γ(u), satisfies that
−1 ˙ ∗
aN (θ̂ − θ ∗ ) = M−1
f × aN `(θ ) + oP (1). (S.5)
Slutsky’s theorem together with (S.3) and (S.5) implies the result in Theorem 1. We prove (S.3) and
(S.4) in the following.
Note that
N
X
˙ ∗) = yi − p(xi ; θ ∗ ) ġ(xi ; θ ∗ )

`(θ
i=1
is a summation of i.i.d. quantities. Since the distribution of {y − p(x; θ ∗ )}ġ(x; θ ∗ ) depends on
N because α∗ → −∞ as N → ∞, we need to use the Lindeberg-Feller central limit theorem for
triangular arrays (see, Section ∗ 2.8 of van der Vaart, 1998).
We examine the mean and variance of a−1 ˙ ∗
N `(θ ). For the mean, from the fact that
h i
E[{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ )] = E E yi − p(xi ; θ ∗ ) xi ġ(xi ; θ ∗ )xi = 0,

we know that E{a−1 ˙ ∗

N `(θ )} = 0.
For the variance,
N
X
V{a−1 ˙ ∗ a−2 V[{y − p(x; θ ∗ )}ġ(xi ; θ ∗ )] = a−2 ∗ ⊗ ∗
N `(θ )} = N N N E{φ(θ )ġ (x; θ )}
i=1
∗ ∗ ∗
eα +f (x;β ) ef (x;β )

= a−2
N NE
⊗ ∗
ġ (x; θ ) = E ⊗ ∗
ġ (x; θ ) .
{1 + eα∗ +f (x;β∗ ) }2 {1 + eα∗ +f (x;β∗ ) }2
We have
∗
ef (x;β ) ∗
ġ ⊗ (x; θ ∗ ) −→ ef (x;β ) ġ ⊗ (x; θ ∗ ) almost surely,
{1 + eα∗ +f (x;β∗ ) }2
12
and
∗
ef (x;β ) ∗ 2 f (x;β ∗ ) ∗
α∗ +f (x;β ∗ ) 2 kġ(x; θ )k ≤ e kġ(x; θ ∗ )k2 with E{ef (x;β ) kġ(x; θ ∗ )k2 } ≤ ∞.

{1 + e }
Note that ġ(x; θ ∗ ) = {1, f˙T (xi ; β ∗ )}T does not depend on N . Thus, from the dominated conver-
gence theorem,
f (x;β∗ ) ⊗
V{a−1 ˙ ∗ ġ (x; θ ∗ ) .

N `(θ )} −→ E e
Now we check the Lindeberg-Feller condition. For any > 0,
N
X h i
E k{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ )k2 I(k{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ )k > aN )
i=1
h i
= N E k{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(k{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
= N E p(x; θ ∗ ){1 − p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )

+ N E {1 − p(x; θ ∗ )}{p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(kp(x; θ ∗ )ġ(x; θ ∗ )k > aN )

≤ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )

+ N E {p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(kp(x; θ ∗ )ġ(x; θ ∗ )k > aN )

∗
≤ a2N E{ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )}
∗
+ a2N E{ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )}
= o(a2N ),
where the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller
central limit theorem (Section ∗ 2.8 of van der Vaart, 1998), we finish the proof of (S.3).
Now we prove (S.4). This is done by noting that
N
X
a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ )
i=1
N ∗ ∗
1 X eα +f (xi ;β )
= ġ ⊗2 (xi ; θ ∗ )
N eα∗ i=1 (1 + eα∗ +f (xi ;β∗ ) )2
N ∗
1 X ef (xi ;β ) ∗
= ġ ⊗2 (xi ; θ ∗ ) = E{ef (x;β ) ġ ⊗2 (x; θ ∗ )} + oP (1),
N i=1 (1 + eα∗ +f (xi ;β∗ ) )2
where the last step is from Lemma 28 of Wang (2019).

First, note that Vw can be written as
∗
Vw = E(ef (x;β ) )M−1 −1
f Mw Mf ,
where ∗
cef (x;β ) f (x;β∗ ) ⊗2

∗
Mw = E 1 + e ġ (x; θ ) .
ϕ(x)
We then point out that under assumptions 1 and 2 the condition E[{ϕ(x) + ϕ−1 (x)}B 2 (x)] < ∞
implies the following:
∗
E ϕ(x)ef (x;β ) kġ(x; θ ∗ )k2 < ∞,

(S.6)
−1
2f (x;β∗ ) ∗ 4

E 1 + ϕ (x) e kġ(x; θ )k < ∞, (S.7)
13
∗
E ϕ−1 (x)e2f (x;β ) kg̈(x; θ ∗ )k2 < ∞.

(S.8)
The estimator θ̂w is the maximizer of

N
X δi
yi g(xi ; θ) − log{1 + eg(xi ;θ) } ,

`w (θ) =
i=1
π(xi , yi )
so aN (θ̂w − θ ) is the maximizer of γw (u) = `w (θ ∗ + a−1

∗ ∗
N u) − `w (θ ). By Taylor’s expansion,
N
1 T˙ 1 X δi
γw (u) = u `w (θ ∗ ) + 2 φ(xi ; θ ∗ )(zT 2
i u) + ∆`w + Rw ,
aN 2aN i=1 π(xi , yi )
where
N
∂`w (θ) X δi
`˙w (θ) =

= yi − p(xi ; θ) ġ(xi , θ),
∂θ i=1
π(xi , yi )
N
1 X δi
yi − p(xi ; θ ∗ ) uT g̈(xi , θ ∗ )u,

∆`w = 2
2aN i=1 π(xi , yi )
and Rw is the remainder. Similarly to the proof of Theorem 1, we only need to show that
a−1 ˙ ∗
N `w (θ ) −→ N(0, Mw ), (S.9)
in distribution,
N
X δi
a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Mf , (S.10)
i=1
π(xi , yi )
in probability for any u, and ∆`w = oP (1) and Rw = oP (1).

We prove (S.9) first. Let ηi = π(xδii,yi ) {yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ ), we know that ηi , i = 1, ..., N , are
i.i.d., with the underlying distribution of ηi being dependent on N . From direct calculation, we have
that E(ηi | xi ) = 0 and
{yi − p(xi ; θ ∗ )}2

V(ηi | xi ) = E xi ġ ⊗2 (xi ; θ ∗ )
{yi + (1 − yi )ρϕ(xi )}
p2 (xi ; θ ∗ ) ⊗2
= p(xi ; θ ∗ ){1 − p(xi ; θ ∗ )}2 ġ ⊗2 (xi ; θ ∗ ) + {1 − p(xi ; θ ∗ )} ġ (xi ; θ ∗ )
ρϕ(xi )
∗
α∗ f (xi ;β ∗ ) ⊗2 ∗ −1 2α∗ e2f (xi ;β ) ⊗2
≤e e ġ (xi ; θ ) + ρ e ġ (xi ; θ ∗ ).
ϕ(xi )
Thus, by the dominated convergence theorem, we obtain that
∗
cef (xi ;β ) f (xi ;β∗ ) ⊗2

∗
V(ηi ) = E{V(ηi | xi )} = eα E 1 + e ġ (xi ; θ ∗ ) {1 + o(1)}.
ϕ(xi )
Now we check the Lindeberg-Feller condition (Section ∗ 2.8 of van der Vaart, 1998). Denote
δ = y + (1 − y)I{u ≤ ρϕ(x)} where u ∼ U(0, 1). For any > 0,
N
X
E kηi k2 I(kηi k > aN )

i=1
= N E kπ −1 (x, y)δ{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(kπ −1 (x, y)δ{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )

h i
= ρN E ϕ(x)kπ −1 (x, y){y − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(kπ −1 (x, y){y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h
+ N E {1 − ρϕ(x)}kπ −1 (x, y)y{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2
14
i
× I(kπ −1 (x, y)y{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h i
= ρN E ϕ(x)p(x; θ ∗ )k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h
+ N E {1 − p(x; θ ∗ )}ρ−1 ϕ−1 (x)kp(x; θ ∗ )ġ(x; θ ∗ )k2
i
× I(kρ−1 ϕ−1 (x){y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h i
+ N E {1 − ρϕ(x)}p(x; θ ∗ )k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
∗
h ∗
i
≤ ρN eα E ϕ(x)ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗
h ∗
i
+ N e2α ρ−1 E ϕ−1 (x)kef (x;β ) ġ(x; θ ∗ )k2 I(kρ−1 ϕ−1 (x){y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
∗
h ∗
i
+ N eα E ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗
= o(N eα ),
where the last step is due to Assumptions 1, (S.6)-(S.8), the dominated convergence theorem, and the
facts that aN → ∞ and limN →∞ eα /ρ = c < ∞. Thus, applying the Lindeberg-Feller central limit
theorem (Section ∗ 2.8 of van der Vaart, 1998) finishes the proof of (S.9).
Now we prove (S.10). Let
N
X δi
Hw ≡ a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ )
i=1
π(xi , yi )
N ∗
1 X δi ef (xi ;β )
= ġ ⊗2 (xi ; θ ∗ )
n i=1
π(xi , yi ) (1 + eα∗ +f (xi ;β∗ ) )2
N ∗
1 X yi + (1 − yi )I{ui ≤ ρϕ(xi )} ef (xi ;β )
= ġ ⊗2 (xi ; θ ∗ ).
n i=1 yi + (1 − yi )ρϕ(xi ) (1 + eα∗ +f (xi ;β∗ ) )2
We notice that
∗
ef (x;β )

∗
ġ ⊗2 (x; θ ∗ ) = E ef (x;β ) ġ ⊗2 (x; θ ∗ ) + o(1),

E(Hw ) =E (S.11)
(1 + eα∗ +f (x;β∗ ) )2
where the last step is by the dominated convergence theorem. In addition, the variance of each
component of Hw is bounded by

1 δ ∗
E 2 e2f (x;β ) kġ(x; θ ∗ )k4
N π (x, y)

1 1 ∗
≤ E e2f (x;β ) kġ(x; θ ∗ )k4
N π(x, y)
n
1 1 o 2f (x;β∗ )
≤ E ρ+ e kġ(x; θ ∗ )k4 = o(1). (S.12)
Nρ ϕ(x)
∗ ∗
where the last step is because of (S.7) and the fact that N eα → ∞ and eα /ρ → c < ∞
imply that N ρ → ∞. From (S.11) and (S.12), Chebyshev’s inequality implies that Hw →
∗
E ef (x;β ) ġ ⊗2 (x; θ ∗ ) in probability.

In the following we finish the proof by showing that ∆`w = oP (1) and Rw = oP (1). For ∆`w , the
mean E(∆`w ) = 0 and the variance satisfies that
kuk4 N {y − p(x; θ ∗ )}2 kg̈(x; θ ∗ )k2

V(∆`w ) ≤ E
4a4N y + (1 − y)ρϕ(x)
4
p(x; θ ∗ )

kuk N ∗ ∗ 2
≤ E 1 + p(x; θ )kg̈(x; θ )k
4a4N ρϕ(x)
15
∗ ∗
kuk4 eα ef (x;β ) f (x;β∗ )

∗ 2
≤ E 1 + e kg̈(x; θ )k = oP (1),
4a2N ρϕ(x)
so ∆`w = oP (1). For the remainder term Rw . By direct calculation,
N
1 X δi
φ(xi ; θ ∗ + a−1 ∗ −1 ∗ −1
T 3
Rw = 3 N ú) 1 − 2p(xi ; θ + aN ú) {u ġ(xi ; θ + aN ú)}
N
1 X δi
+ 3 φ(xi ; θ ∗ + a−1 T ∗ −1 T ∗
−1
N d
1 X δi X ...
yi − p(xi ; θ ∗ + a−1 ∗ −1

+ 3 N ú) u u g
j1 j2 j1 j2 (x i ; θ + a N ú)u ,
6aN i=1 π(xi , yi ) j j 1 2
where ú lies between 0 and u. We see that

N N
kuk3 X δi ∗ −1 ∗ −1 dkuk3 X δi
|Rw | ≤ p(x i ; θ + aN ú)C(x i , θ + aN ú) + yi B(xi )
6a3N i=1 π(xi , yi ) 6a3N i=1 π(xi , yi )
−1 N
kuk3 eaN kuk X δi
≤ exp{f (xi ; β + a−1 ∗ −1
N ú(−1) )}C(xi , θ + aN ú)
6N aN i=1
π(x i , yi )
N
dkuk3 X δi
+ yi B(xi )
6a3N i=1 π(xi , yi )
−1 N N
kuk
d3 kuk3 eaN X δi dkuk3 X δi
≤ B(xi ) + yi B(xi )
6N aN i=1
π(xi , yi ) 6a3N i=1 π(xi , yi )
≡ ∆R w 1 + ∆ R w 2 ,
where C(x, θ) is defined in (S.2) and ú(−1) is ú with the first element removed. From Assumption 1,
E(∆Rw 1 ) → 0 and
dkuk3 f (x;β∗ )
E(∆Rw 2 ) ≤ E e B(x) → 0.
6aN
Since ∆Rw 1 and ∆Rw 2 are both positive, Markov’s inequality shows that they are both oP (1) so
Rw = oP (1).

Since only Vsub in Vw is affect by subsampling, the probability that minimizes tr(Vsub ) also
∗
minimizes tr(Vw ). Let Λsub = E{ϕ−1 (x)e2f (x;β ) ġ ⊗2 (x; θ ∗ )}. We notice that
∗
Vsub = cE(ef (x;β ) )M−1 −1
f Λsub Mf
∗ ∗
= cE(ef (x;β ) )e−2α {1 + oP (1)}M−1
f E{ϕ
−1
(x)p2 (x; θ ∗ )ġ ⊗2 (x; θ ∗ )}M−1
f ,
∗ ∗
where the term cE(ef (x;β ) )e−2α does not depend on ϕ(x). Thus we can focus on minimizing the
term M−1
f E{ϕ
−1
(x)p2 (x; θ ∗ )ġ ⊗2 (x; θ ∗ )}M−1
f .
Note that
tr[p2 (x; θ ∗ )M−1
f ġ
⊗2
(x; θ ∗ )M−1 −1 2 2 ∗
f ] = kp(x; θ)Mf ġ(x; θ)k = t (x; θ ).
Thus the problem is to find ϕ(x) that minimizes
t (x; θ ∗ )
2
E , subject to 0 < ϕ(x) ≤ ρ−1 and E{ϕ(x)} = 1.
ϕ(x)
To facilitate the presentation, denote
1
ζ= ,
E[min{t(x; θ ∗ ), T }]
16
which is non-random. We notice that
ζ t (x; θ ∗ )
2 2
E
ϕ(x)
ζ [min{t(x; θ ∗ ), T }]2 ζ {t (x; θ ∗ ) − T 2 }
2 2 2
=E +E I{t(x; θ ∗ ) > T }
ϕ(x) ϕ(x)
!
2
ζ min{t(x; θ ∗ ), T }

p
− E{ϕ(x)} + 2E ζ min{t(x; θ ∗ ), T }

=E ϕ(x) − p
ϕ(x)
∗
2 2 2

ζ {t (x; θ ) − T } ∗
+E I{t(x; θ ) > T }
ϕ(x)
2 !
ζ min{t(x; θ ∗ ), T } ζ {t (x; θ ∗ ) − T 2 }
2 2
p ∗
=E ϕ(x) − p +E I{t(x; θ ) > T } + 1
ϕ(x) ϕ(x)
≡ E1 + E2 + 1,
where I(·) is the indicator function. Note that Ei ’s are non-negative and E1 = 0 if and only if
min{t(x; θ ∗ ), T }
ϕ(x) = ζ min{t(x; θ ∗ ), T } = ,
E[min{t(x; θ ∗ ), T }]
which is attainable because
ρ min{t(x; θ ∗ ), T } ≤ E[min{t(x; θ ∗ ), T }], almost surely. (S.13)
If t(x; θ ∗ ) ≤ T almost surely, then E2 = 0 and the proof finishes. If Pr{t(x; θ ∗ ) > T } > 0 then
ρT
= 1,
E[min{t(x; θ ∗ ), T }]
because if this is not true then we can find a larger T that satisfies (S.13). This means that if
t(x; θ ∗ ) > T , then
min{t(x; θ ∗ ), T }
ϕ(x) = = ρ−1 ,
E[min{t(x; θ ∗ ), T }]
which minimizes E2 as well since ϕ(x) is in the denominator. This finishes the proof.
S.4 Derivation of corrected model (4)
Note that π(x, 1) = 1 and π(x, 0) = π(x). By direct calculation, we have
Pr(y = 1 | x, δ = 1)
Pr(y = 1, δ = 1 | x)
=
Pr(δ = 1 | x)
Pr(y = 1, δ = 1 | x)
=
Pr(y = 1, δ = 1 | x) + Pr(y = 0, δ = 1 | x)
Pr(y = 1 | x) Pr(δ = 1 | x, y = 1)
=
Pr(y = 1 | x) Pr(δ = 1 | x, y = 1) + Pr(y = 0 | x) Pr(δ = 1 | x, y = 0)
p(x; θ ∗ )π(x, 1)
=
p(x; θ )π(x, 1) + {1 − p(x; θ ∗ )}π(x, 0)
∗
∗
eg(x;θ )
=
eg(x;θ∗ ) + π(x)
∗
eg(x;θ )+l
= .
1 + eg(x;θ∗ )+l
17
The estimator θ̂lik is the maximizer of `lik (θ), so uN = aN (θ̂lik − θ) is the maximizer of
γlik (u) = `lik (θ + a−1
N u) − `lik (θ).
N
X
T˙
γlik (u) = a−1 ∗
N u `lik (θ ) + 0.5a−2
N φπ (xi ; θ ∗ ){uT ġ(xi ; θ ∗ )}2 + ∆`lik + Rlik ,
i=1
where φπ (x; θ) = pπ (x; θ){1 − pπ (x; θ)},

eg(x;θ)+l
pπ (x; θ) = with l = − log{ρϕ(x)}, (S.14)
1 + eg(x;θ)+l
N
∂`lik (θ) X
`˙lik (θ) =

= δi yi − pπ (xi ; θ) ġ(xi ; θ),
∂θ i=1
N
1 X
δi yi − pπ (xi ; θ ∗ ) uT g̈(xi ; θ ∗ )u,

∆`lik = 2
2aN i=1
and Rlik is the remainder term. By direct calculation,
N
1 X
δi φπ (xi ; θ ∗ + a−1 ∗ −1 ∗ −1
T 3
Rlik = 3 N ú) 1 − 2pπ (xi ; θ + aN ú) {u ġ(xi ; θ + aN ú)}
6aN i=1
N
1 X
+ δi φπ (xi ; θ ∗ + a−1 T ∗ −1 T ∗ −1
6a3N i=1
N d
1 X ∗ −1
X ... ∗ −1
+ 3 δi yi − pπ (xi ; θ + aN ú) g
uj1 uj2 j1 j2 (xi ; θ + aN ú)u ,
6aN i=1 j ,j =1 1 2
where ú lies between 0 and u. We see that

N N
kuk3 X ∗ −1 ∗ −1 dkuk3 X
|Rlik | ≤ δ p
i π (xi ; θ + aN ú)C(x i , θ + aN ú) + δi yi B(xi )
6a3N i=1 6a3N i=1
−1 N
kuk3 eaN kuk X
≤ δi exp[f (xi ; β + a−1 ∗ −1
N ú(−1) ) − log{ρϕ(xi )}]C(xi , θ + aN ú)
6N aN i=1
N
dkuk3 X
+ δi yi B(xi )
6a3N i=1
−1 N
kuk3 eaN kuk X
= δi ϕ−1 (xi ) exp[f (xi ; β + a−1 ∗ −1
N ú(−1) )]C(xi , θ + aN ú)
6N aN ρ i=1
N
dkuk3 X
+ δi yi B(xi )
6a3N i=1
−1 N N
d3 kuk3 eaN kuk X dkuk3 X
≤ δi ϕ−1 (xi )B(xi ) + yi B(xi )
6N aN ρ i=1
6a3N i=1
≡ ∆Rlik 1 + ∆R2 = ∆Rlik 1 + oP (1),
where ú(−1) is ú with the first element removed, C(x, θ) and ∆R2 = oP (1) are defined in (S.2). For
∆Rlik 1 , we see that
n N
X o
−1
E (N ρ) δi ϕ−1 (xi )B(xi ) = ρ−1 E [{p(xi ; θ ∗ )ϕ−1 (x) + ρ{1 − p(xi ; θ ∗ )}]B(x)
i=1
18
∗ ∗
≤ eα ρ−1 E{ef (x;β ) ϕ−1 (x)B(x)} + E{B(x)}.
Therefore, E(∆Rlik 1 ) → 0. Since ∆Rlik 1 is positive, Markov’s inequality shows that ∆Rlik 1 = oP (1)
and thus Rlik = oP (1).
For ∆`lik , the mean E(∆`lik ) = 0 and the variance satisfies that
N
kuk4 X kuk4 ∗
V(∆`lik ) ≤ 4 E{δi pπ (xi ; θ ∗ )kg̈(xi ; θ ∗ )k2 } ≤ E[ef (x;β ) kg̈(x, θ ∗ )k2 ] → 0,
4aN i=1 4a2N
∗
where the last step is because Assumption 1 implies that ef (x;β ) kg̈(x, θ ∗ )k2 is integrable, so
∆`lik = oP (1).
If we can show that
a−1 ˙ ∗

N `lik (θ ) −→ N 0, Λlik , (S.15)
in distribution, and
N
X
a−2
N δi φπ (xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Λlik , (S.16)
i=1
aN (θ̂lik − θ ∗ ), the maximizer of γlik (u), satisfies that
−1 ˙
aN (θ̂lik − θ ∗ ) = Λ−1 ∗
lik × aN `lik (θ ) + oP (1). (S.17)
Slutsky’s theorem together with (S.15) and (S.17) implies the result in Theorem 1. We prove (S.15)
and (S.16) in the following.
Note that `˙lik (θ ∗ ) is a summation of i.i.d. quantities, δi yi − pπ (xi ; θ ∗ ) ġ(xi ; θ ∗ )’s, whose distri-

bution depends on N . Note that
E[δ{y − pπ (x; θ ∗ )}ġ(x; θ ∗ )]
= E[π(x, y){y − pπ (x; θ ∗ )}ġ(x; θ ∗ )]

= E p(x; θ ∗ ){1 − pπ (x; θ ∗ )} − {1 − p(x; θ ∗ )}ρϕ(x)pπ (x; θ ∗ ) ġ(x; θ ∗ ) = 0,

(S.18)
where the last step is by a direct calculation from (S.14) to obtain the following equality
p(x; θ ∗ ){1 − pπ (x; θ ∗ )} = {1 − p(x; θ ∗ )}ρϕ(x)pπ (x; θ ∗ ). (S.19)
Thus we know that E{a−1 ˙ ∗ −1 ˙ ∗

N `lik (θ )} = 0. The variance of aN `lik (θ ) satisfies that
V{a−1 ˙ ∗
N `lik (θ )}
= a−2 ∗ 2 ⊗2
(x; θ ∗ )

N N E δ{y − pπ (x; θ )} ġ
∗
= e−α E {y + (1 − y)ρϕ(x)}{y − pπ (x; θ ∗ )}2 ġ ⊗2 (x; θ ∗ )

∗

= e−α E p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 + {1 − p(x; θ ∗ )}ρϕ(x)p2π (x; θ ∗ ) ġ ⊗2 (x; θ ∗ )

(S.20)
∗

= e−α E p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 + p(x; θ ∗ ){1 − pπ (x; θ ∗ )}pπ (x; θ ∗ ) ġ ⊗2 (x; θ ∗ )

(S.21)
∗

= e−α E {1 − pπ (x; θ ∗ )}2 + {1 − pπ (x; θ ∗ )}pπ (x; θ ∗ ) p(x; θ ∗ )ġ ⊗2 (x; θ ∗ )

(S.22)
∗

= e−α E {1 − pπ (x; θ ∗ )}2 + φπ (x; θ ∗ ) p(x; θ ∗ )ġ ⊗2 (x; θ ∗ )

(S.23)
∗

= E {1 − pπ (x; θ ∗ )}2 + φπ (x; θ ∗ ) {1 − p(x; θ ∗ )}ef (x;β ) ġ ⊗2 (x; θ ∗ )

(S.24)
where the forth equality uses (S.19). Note that elements of the term in the expectation of (S.24) are all
∗
bounded by integrable random variable ef (x;β ) kġ(x; θ ∗ )k2 . Thus, from the dominated convergence
theorem and the fact that
{1 − pπ (x; θ ∗ )}2 + φπ (x; θ ∗ ) {1 − p(x; θ ∗ )}

19
1
= 1 − pπ (x; θ ∗ ) + oP (1) = + oP (1),
1+ cϕ−1 (x)ef (x;β∗ )
we have
f (x;β∗ ) ⊗2
ġ (x; θ ∗ )

˙ e
V{a−1 ∗
N `lik (θ )} −→E = Λlik . (S.25)
1 + cϕ−1 (x)ef (x;β∗ )
Now we check the Lindeberg-Feller condition. For any > 0,

N
X h i
E kδi {yi − pπ (xi ; θ ∗ )}ġ(xi ; θ ∗ )k2 I(kδi {yi − pπ (xi ; θ ∗ )}ġ(xi ; θ ∗ )k > aN )
i=1
h i
= N E δk{y − pπ (x; θ ∗ )}ġ(x; θ ∗ )k2 I(δk{y − pπ (x; θ ∗ )}ġ(x; θ ∗ )k > aN )
= N E p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 kġ(x; θ ∗ )k2 I({1 − pπ (x; θ ∗ )}kġ(x; θ ∗ )k > aN )

+ N E {1 − p(x; θ ∗ )}ρϕ(x){pπ (x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(ρϕ(x)pπ (x; θ ∗ )kġ(x; θ ∗ )k > aN )

∗ ∗
≤ N E eα +f (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )

∗ ∗ ∗ ∗
+ N E eα +f (x;β ) pπ (x; θ ∗ )kġ(x; θ ∗ )k2 I(eα +f (x;β ) kġ(x; θ ∗ )k > aN )

∗
≤ a2N E ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )

∗ ∗ ∗
+ a2N E ef (x;β ) kġ(x; θ ∗ )k2 I(eα +f (x;β ) kġ(x; θ ∗ )k > aN )

= o(a2N ),
where the inequality in the third step uses the following fact derived from (S.14)
∗
+f (x;β ∗ ) ∗
+f (x;β ∗ )
ρϕ(x)pπ (x; θ ∗ ) = eα {1 − pπ (x; θ ∗ )} ≤ eα , (S.26)
and the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller
Now we prove (S.16). Denote
N
X
Hlik = a−2
N δi φπ (xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ).
i=1
We notice that
E(Hlik ) = N a−2 ∗ ⊗2
N E[{y + (1 − y)ρϕ(x)}φπ (x; θ )ġ (x; θ ∗ )]
∗
= e−α E[{1 − p(x; θ ∗ )}ρϕ(x)pπ (x; θ ∗ ){1 − pπ (x; θ ∗ )}ġ ⊗2 (x; θ ∗ )]
∗
+ e−α E[p(x; θ ∗ )pπ (x; θ ∗ ){1 − pπ (x; θ ∗ )}ġ ⊗2 (x; θ ∗ )]
∗
= e−α E[p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 ġ ⊗2 (x; θ ∗ )]
∗
+ e−α E[{1 − p(x; θ ∗ )}ρϕ(x)p2π (x; θ ∗ )ġ ⊗2 (x; θ ∗ )]
= V{a−1 `˙lik (θ ∗ )} = Λlik + o(1),
N (S.27)
where the third equality is obtained by applying (S.19) and the forth equality is from (S.20). In
addition, the variance of each component of Hlik is bounded by
a−4 2 ∗ ∗ 4
N N E{δφπ (xi ; θ )kġ(xi ; θ )k }
≤ a−4 ∗ ∗ 4
N N E[{y + (1 − y)ρϕ(x)}pπ (xi ; θ )kġ(xi ; θ )k ]
≤ a−4 ∗ ∗ ∗ 4
N N E[{p(x; θ ) + ρϕ(x)}pπ (xi ; θ )kġ(xi ; θ )k ]
∗ ∗
≤ 2a−4 α
N N e E{e
f (x;β )
kġ(xi ; θ ∗ )k4 } = o(1), (S.28)
where the last inequality is because of (S.26) and the last step if because Assumption 1 implies
∗
that ef (x;β ) kġ(xi ; θ ∗ )k4 is integrable. From (S.27) and (S.28), Chebyshev’s inequality implies that
Hlik → Λlik in probability.
20
∗
Let κ(x) = 1 + cϕ−1 (x)ef (x;β ) . We first notice that
∗
[E{ef (x;β ) }]−1 Vw = M−1 −1 −1 −1
f Mf Mf + Mf cΛsub Mf
∗
= M−1
f E[κ(x)e ġ (x; θ ∗ )]M−1
f (x;β ) ⊗2
f ,
and ∗ ∗ −1
[E{ef (x;β ) }]−1 Vlik = Λ−1
−1
lik = E κ (x)ef (x;β ) ġ ⊗2 (x; θ ∗ )
We only need to show that
∗
[E{ef (x;β ) }]−1 Vw ≥ Λ−1 lik .
This is proved by the following calculation
⊗2
f (x;β ∗ )/2
0 ≤E {κ1/2 (x)M−1 f −κ
−1/2
(x)Λ−1 lik }e ġ(x; θ ∗ )
n o n o
f (x;β ∗ ) ⊗2 f (x;β ∗ ) ⊗2
=E M−1 f κ(x)e ġ (x; θ ∗
)M −1
f + E Λ −1 −1
lik κ (x)e ġ (x; θ ∗
)Λ−1
lik
n ∗
o
− 2E M−1 f e ġ (x; θ ∗ )Λ−1
f (x;β ) ⊗2
lik
∗ ∗
=[E{ef (x;β ) }]−1 Vw + Λ−1 −1
lik − 2Λlik = [E{e }] Vw − Λ−1
f (x;β ) −1
lik .
If c = 0 then κ(x) = 1, and we can directly verify that

∗
[E{ef (x;β ) }]−1 Vw = Λ−1 −1
lik = Mf . (S.29)

For sampled data, (5) tell us that the joint density w.r.t.
the product counting measure of the responses
PN
given the features is exp `lik (θ) + i=1 δi yi li , whose support has a finite number of possible
values. Thus the expectation E{U(θ; Dδ ) | X} is a sum of finite number of elements, and therefore
the partial derivatives w.r.t. θ can be passed under the integration sign in E{U(θ; Dδ ) | X}. Therefore
we have
∂
I= E{U(θ; Dδ ) | X}
∂θ T
Z N
∂ X
= U(θ; D δ ) exp `lik (θ) + δ i yi l i dy
∂θ T i=1
Z XN
= U̇(θ; Dδ ) exp `lik (θ) + δi yi li dy
i=1
Z N
X
+ U(θ; Dδ ) exp `lik (θ) + δi yi li `˙T
lik (θ)dy
i=1
= E{U̇(θ; Dδ ) | X} + E{U(θ; Dδ )`˙T

lik (θ) | X},
where dy = dy × dy × ...dy is a product measure and dy is the counting measure. This gives us that
E{U(θ; Dδ )`˙T
lik (θ) | X} = I (S.30)
Hence we have
−1 ˙
V U(θ; Dδ ) − a−2

N Λlik `lik (θ) X ≥ 0

= V{U(θ; Dδ ) | X} + V aN Λlik `˙lik (θ) X − 2a−2 ˙T
−2 −1 −1
N E{U(θ; Dδ )`lik (θ) | X}Λlik
= V{U(θ; Dδ ) | X} + a−2 Λ−1 V{a−1 `˙lik (θ) X}Λ−1 − 2a−2 Λ−1 .

N lik N lik N lik
Taking expectation of the above conditional variance and using (S.25), we have
E[V{U(θ; Dδ ) | X}] + a−2 −1 −1 ˙ −1 −2 −1

N Λlik E[aN V{`lik (θ) X}]Λlik − 2aN Λlik

21
= V{U(θ; Dδ )} + a−2 −1 −2 −1
N Λlik {1 + oP (1)} − 2aN Λlik
= V{U(θ; Dδ )} − a−2 −1
N Λlik {1 + oP (1)}
= V{U(θ; Dδ )} − N1−1 Vlik {1 + oP (1)} ≥ 0,
∗
where the last equality is because (S.1) implies that a−2 −1
N = N1 E{e
f (x;β )
}{1 + oP (1)}. This
finishes the proof.

The outline of the proof is similar to that of the proof of Theorem 2. Write π%os (x, y; ϑ̃) = y + (1 −
y)π%os (x; ϑ̃) and δ ϑ̃ = y + (1 − y)I{u ≤ π%os (x; ϑ̃)} where u ∼ U(0, 1). The estimator θ̂w
ϑ̃
is the
maximizer of
N
X δiϑ̃
`ϑ̃ yi g(xi ; θ) − log{1 + eg(xi ;θ) } ,

w (θ) =
i=1
π%os (x, y; ϑ̃)
−1
ϑ̃
so aN (θ̂w − θ ∗ ) is the maximizer of γw
ϑ̃
(u) = `ϑ̃ ∗ ϑ̃ ∗
w (θ + aN u) − `w (θ ). By Taylor’s expansion,
N
1 T ˙ϑ̃ ∗ 1 X δiϑ̃
ϑ̃
γw (u) = u `w (θ ) + 2 φ(xi ; θ ∗ )(zT 2 ϑ̃ ϑ̃
i u) + ∆`w + Rw ,
aN 2aN i=1 π%os (x, y; ϑ̃)
where
N
X δiϑ̃
`˙ϑ̃

w (θ) = yi − p(xi ; θ) ġ(xi , θ),
os
π% (x, y; ϑ̃)
i=1
N
1 X δiϑ̃
∆ϑ̃ yi − p(xi ; θ ∗ ) uT g̈(xi , θ ∗ )u,

`w = 2
2aN i=1 π%os (x, y; ϑ̃)
ϑ̃
and Rw is the remainder. Similarly to the proof of Theorem 2, we only need to show that
a−1 ˙ϑ̃ ∗ plt

N `w (θ ) −→ N(0, Λw ), (S.31)
in distribution with
∗
ce2f (xi ;β )

∗
Λplt
w = E ef (xi ;β ) ġ ⊗2 (xi ; θ ∗ ) + ġ ⊗2 (xi ; θ ∗ ) ,
max{ϕplt (x), cl }
and for any u,
N
X δiϑ̃
Hwϑ̃ := a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Mf , (S.32)
i=1
π%os (x, y; ϑ̃)
in probability, and ∆ϑ̃ ϑ̃

`w = oP (1) and Rw = oP (1).
δiϑ̃
We prove (S.31) first. Let ηiϑ̃ = π%os (x,y;ϑ̃)
{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ ). Given ϑ̃, ηiϑ̃ , i = 1, ..., N , are
i.i.d. with mean zero, and the variance satisfies that
V(ηiϑ̃ | ϑ̃)
{yi − p(xi ; θ ∗ )}2

ġ ⊗2 (xi ; θ ∗ ) ϑ̃

=E
{yi + (1 − yi )π%os (xi ; ϑ̃)}
p2 (xi ; θ ∗ ) ⊗2

∗ ∗ 2 ⊗2 ∗ ∗ ∗

= E p(xi ; θ ){1 − p(xi ; θ )} ġ (xi ; θ ) + {1 − p(xi ; θ )} ġ (xi ; θ ) ϑ̃
os
π% (xi ; ϑ̃)
∗ ⊗2 ∗ −1 2 ∗ ⊗2 ∗

≤ E p(xi ; θ )ġ (xi ; θ ) + % p (xi ; θ )ġ (xi ; θ ) ϑ̃
∗ ∗ ∗ ∗
≤ eα E ef (xi ;β ) ġ ⊗2 (xi ; θ ∗ ) + (%−1 eα )e2f (xi ;β ) ġ ⊗2 (xi ; θ ∗ )

22
Since ρ−1 % → cl > 0, for large enough N ,
f (xi ;β∗ ) ⊗2 2f (xi ;β ∗ ) ⊗2
V{a−1 ˙ϑ̃ ∗ ġ (xi ; θ ∗ ) + (1 + c−1 ġ (xi ; θ ∗ ) .

N `w (θ ) | ϑ̃} ≤ E e l c)e
Thus, we know that
V{a−1 ˙ϑ̃ ∗ plt
N `w (θ )} → Λw
Now we check the Lindeberg-Feller condition (Section ∗ 2.8 of van der Vaart, 1998) given ϑ̃. For any
> 0,
N
X
E kηiϑ̃ k2 I(kηiϑ̃ k > aN ) ϑ̃

i=1
= N E k{π%os (x, y; ϑ̃)}−1 δ ϑ̃ {y − p(x; θ ∗ )}ġ(x; θ ∗ )k2

× I(k{π%os (x, y; ϑ̃)}−1 δ ϑ̃ {y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN ) ϑ̃

h
= N E {π%os (x, y; ϑ̃)}−1 k{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2
i
× I(k{π%os (x, y; ϑ̃)}−1 {y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN ) ϑ̃
h
+ N E {1 − π%os (x, y; ϑ̃)}k{π%os (x, y; ϑ̃)}−1 y{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2
i
× I(k{π%os (x, y; ϑ̃)}−1 y{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN ) ϑ̃
h i
≤ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
h i
+ N E %−1 kp(x; θ ∗ )ġ(x; θ ∗ )k2 I(k%−1 p(x; θ ∗ )ġ(x; θ ∗ )k > aN )
h i
+ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
h ∗
i
≤ 2a2N E ef (xi ;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗
h ∗ ∗ ∗
i
+ a2N eα %−1 E e2f (xi ;β ) kġ(x; θ ∗ )k2 I(%−1 eα kef (xi ;β ) ġ(x; θ ∗ )k > aN )
= o(a2N ),
where the last step is due to Assumptions 1, the dominated convergence theorem, and the facts
that aN → ∞ and limN →∞ eα /% = c/cl < ∞. Thus, applying the Lindeberg-Feller central limit
theorem (Section ∗ 2.8 of van der Vaart, 1998) finishes the proof of (S.31).
Now we prove (S.32). We notice that
∗
ef (x;β )

ϑ̃ ⊗2 ∗
E(Hw | ϑ̃) =E ġ (x; θ ) = Mf + o(1), (S.33)
(1 + eα∗ +f (x;β∗ ) )2
where the last step is by the dominated convergence theorem. In addition, the conditional variance of
each component of Hwϑ̃ is bounded by
2f (x;β∗ )
kġ(x; θ ∗ )k4

1 e 1 ∗
E ϑ̃ ≤ E{e2f (x;β ) kġ(x; θ ∗ )k4 } = o(1). (S.34)
N π%os (x, y; ϑ̃) N%
∗ ∗
where the last step is because of Assumption 1 and the fact that N eα → ∞, eα /ρ → c < ∞, and
ρ−1 % → cl > 0 imply that N % → ∞. From (S.33) and (S.34), applying Chebyshev’s inequality
finishes the proof of (S.32).
In the following we finish the proof by showing that given ϑ̃, ∆ϑ̃ ϑ̃
`w = oP (1) and Rw = oP (1).
For ∆ϑ̃ ϑ̃
`w , the conditional mean E(∆`w | ϑ̃) = 0 and the conditional variance satisfies that
kuk4 N {y − p(x; θ ∗ )}2 kg̈(x; θ ∗ )k2

ϑ̃
V(∆`w | ϑ̃) ≤ E ϑ̃
4a4N y + (1 − y)π%os (x, y; ϑ̃)
23
kuk4 N
E {1 + %−1 p(x; θ ∗ )}p(x; θ ∗ )kg̈(x; θ ∗ )k2

≤ 4
4aN
kuk4 ∗ ∗ ∗
E {1 + %−1 eα ef (x;β ) }ef (x;β ) kg̈(x; θ ∗ )k2 = o(1),

≤ 2
4aN
so ∆ϑ̃
`w = oP (1).
ϑ̃
For the remainder term Rw . By direct calculations similar to those used for the proof of Theorem 2,
we know that
−1 N N
kuk
ϑ̃ d3 kuk3 eaN X δiϑ̃ dkuk3 X δiϑ̃
|Rw |≤ B(xi ) + 3 yi B(xi )
6N aN i=1
π%os (x, y; ϑ̃) 6aN i=1 π%os (x, y; ϑ̃)
≡ ∆ϑ̃
Rw 1 + ∆ϑ̃
Rw 2 .
From Assumption 1, E(∆ϑ̃ ϑ̃ ϑ̃ ϑ̃

Rw 1 ) → 0 and E(∆Rw 2 ) → 0. Since ∆Rw 1 and ∆Rw 2 are both positive,
ϑ̃
Markov’s inequality shows that they are both oP (1) so Rw = oP (1).

ϑ̃
The outline of the proof is similar to that of the proof of Theorem 4. The estimator θ̂lik is the
ϑ̃ ϑ̃ −1 ϑ̃
maximizer of `lik (θ), so uN = aN (θ̂lik − θ) is the maximizer of
ϑ̃ −1
γlik (u) = `ϑ̃ ϑ̃
lik (θ + aN u) − `lik (θ).
N
X
T ˙ϑ̃
ϑ̃
γlik (u) = a−1 ∗ −2
N u `lik (θ ) + 0.5aN φπ (xi ; θ ∗ ){uT ġ(xi ; θ ∗ )}2 + ∆ϑ̃ ϑ̃
`lik + Rlik , (S.35)
i=1
where φϑ̃ ϑ̃ ϑ̃
π (x; θ) = pπ (x; θ){1 − pπ (x; θ)},
eg(x;θ)+l̃
pϑ̃
π (x; θ) = , with ˜li = − log{π%os (x; ϑ̃)}, (S.36)
1 + eg(x;θ)+l̃
N
X
`˙ϑ̃ δiϑ̃ yi − pϑ̃

lik (θ) = π (xi ; θ) ġ(xi ; θ),
i=1
N
1 X ϑ̃ ∗ ∗
∆ϑ̃ δi yi − pϑ̃
T
`lik = 2 π (xi ; θ ) u g̈(xi ; θ )u, (S.37)
2aN i=1
ϑ̃
and Rlik is the remainder term. By direct calculations similar to those used for the proof of Theorem 4,
we know that
N N N
ϑ̃ kuk3 X ϑ̃ ∗ −1 dkuk3 X ϑ̃ (d + 1)kuk3 X ϑ̃
|Rlik |≤ δ i C(x i , θ + a N ú) + δ i yi B(x i ) ≤ δi B(xi ),
6a3N i=1 6a3N i=1 6a3N i=1
where ú lies between 0 and u.

Since % = o(ρ), we know that for large enough N ,
n N
X o
−3
E aN δiϑ̃ B(xi ) ϑ̃

i=1
n o
= N a−3 [y + (1 − y)π%os (x; ϑ̃)]B(x) ϑ̃

N E
n ∗ o
α f (x;β ∗ )
≤ N a−3 [e e + ω̃ −1
ρt(x; θ̃)]B(x) ϑ̃

N E
24
f (x;β∗ ) −α∗
≤ a−1 B(x) + a−1 −1
ρ)(eα̃ ω̃)−1 E ef (x;θ̃) kġ(x; θ̃)kB(x) ϑ̃

N E e N λ̃min (e
f (x;β∗ ) ∗
≤ a−1 −1 −1
λ̃min (e−α ρ)(eα̃ ω̃)−1 E{B 2 (x)} = oP (1),

N E e B(x) + aN
ϑ̃
where λ̃min is the minimum eigenvalue of M̃f . Thus, Markov’s inequality shows that Rlik = oP (1).
ϑ̃ ϑ̃
For ∆`lik , the conditional mean E(∆`lik | ϑ̃) = 0 and the conditional variance satisfies that
N
kuk4 X ∗ ∗ 2
V(∆ϑ̃
`lik | ϑ̃) ≤ E{δiϑ̃ pϑ̃
π (xi ; θ )kg̈(xi ; θ )k | ϑ̃}
4a4N i=1
kuk4 ∗
≤ E[ef (x;β ) kg̈(x, θ ∗ )k2 ] → 0,
4a2N
so ∆ϑ̃
`lik = oP (1).
Now we need to show that given ϑ̃

˙ϑ̃ ∗ plt
a−1
N `lik (θ ) −→ N 0, Λlik , (S.38)
in distribution, with
f (x;β∗ ) ⊗2
ġ (x; θ ∗ )

e
Λplt
lik =E ,
1 + cϕ−1
plt (x)e
f (x;β ∗ )
and
N
X
ϑ̃
Hlik := a−2
N δiϑ̃ φϑ̃ ∗ ⊗2
π (xi ; θ )ġ (xi ; θ ∗ ) −→ Λplt
lik , (S.39)
i=1
ϑ̃
aN (θ̂lik − θ ∗ ), the maximizer of γlik
ϑ̃
(u), satisfies that
ϑ̃
aN (θ̂lik − θ ∗ ) = (Λplt
lik )
−1
× a−1 ˙ϑ̃ ∗
N `lik (θ ) + oP (1). (S.40)
Slutsky’s theorem together with (S.38) and (S.40) implies the result in Theorem 1. We prove (S.38)
and (S.39) in the following.
Given ϑ̃, `˙ϑ̃ ∗ ∗ ∗
ϑ̃
ϑ̃

lik (θ ) is a summation of i.i.d. quantities, δi yi − pπ (xi ; θ ) ġ(xi ; θ )’s. Using a
similar approach to obtain (S.18) we know that E{a−1 ˙ϑ̃ ∗
N `lik (θ ) | ϑ̃} = 0. The conditional variance
of a−1 ˙ϑ̃ ∗
N `lik (θ ) satisfies that
V{a−1 ˙ϑ̃ ∗
N `lik (θ ) | ϑ̃}
∗
= e−α E {y + (1 − y)π%os (x; ϑ̃)}{y − pϑ̃ ∗ 2 ⊗2
(x; θ ∗ ) ϑ̃

π (x; θ )} ġ
∗

= e−α E [p(x; θ ∗ ){1 − pϑ̃
π (x; θ ∗ 2
)} + π os
% (x; ϑ̃){1 − p(x; θ ∗
)}p 2
π (x; θ ∗
)]ġ ⊗2
(x; θ ∗
) ϑ̃
∗
= e−α E [{1 − pϑ̃ ∗ 2 ϑ̃ ∗ ∗ 2 ∗ ⊗2
(x; θ ∗ ) ϑ̃

π (x; θ )} + φπ (x; θ ){1 − p(x; θ )} ]p(x; θ )ġ
∗ 2 ∗ ∗ 2 ∗ f (x;β ∗ ) ⊗2
= E [{1 − pϑ̃ ϑ̃
ġ (x; θ ∗ ) ϑ̃ .

π (x; θ )} + φπ (x; θ ){1 − p(x; θ )} ]{1 − p(x; θ )}e
The elements of the term on the right hand side of the above equation are bounded by an integrable
∗
random variable ef (x;β ) kġ(x; θ ∗ )k2 , so if % = o(ρ) then
˙ϑ̃ ∗ plt
V{a−1
N `lik (θ )} −→ Λlik . (S.41)
Using an similar approach as in checking the Lindeberg-Feller condition in the proof of Theorem 4,
we obtain that for any > 0,
N
X h i
∗ ∗ 2 ∗ ∗
E kδiϑ̃ {yi − pϑ̃ ϑ̃ ϑ̃
π (xi ; θ )}ġ(xi ; θ )k I(kδi {yi − pπ (xi ; θ )}ġ(xi ; θ )k > aN ) ϑ̃

i=1
25
h i
∗ ∗ 2 ∗ ∗
= N E δ ϑ̃ k{y − pϑ̃ ϑ̃ ϑ̃
π (x; θ )}ġ(x; θ )k I(δ k{y − pπ (x; θ )}ġ(x; θ )k > aN ) ϑ̃

= N E p(x; θ ∗ ){1 − pϑ̃ ∗ 2 ∗ 2 ϑ̃ ∗ ∗

π (x; θ )} kġ(x; θ )k I({1 − pπ (x; θ )}kġ(x; θ )k > aN ) ϑ̃

h
+ N E {1 − p(x; θ ∗ )}π%os (x; ϑ̃){pϑ̃ ∗ 2 ∗ 2
π (x; θ )} kġ(x; θ )k
i
∗ ∗
× I(π%os (x; ϑ̃)pϑ̃
π (x; θ )kġ(x; θ )k > aN ) ϑ̃

≤ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN ) ϑ̃

∗ ∗
∗ ∗ 2 α∗ +f (x;β ∗ )
+ N E eα +f (x;β ) pϑ̃ kġ(x; θ ∗ )k > aN ) ϑ̃

π (x; θ )kġ(x; θ )k I(e
∗
≤ a2N E{ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )}
∗ ∗ ∗
+ a2N E ef (x;β ) kġ(x; θ ∗ )k2 I(eα +f (x;β ) kġ(x; θ ∗ )k > aN )

= o(a2N ),
where the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller
Now we prove (S.39). By similar derivations in (S.27), we know that
E(H ϑ̃ | ϑ̃) = V{a−1 `˙ϑ̃ (θ ∗ ) | ϑ̃} = Λplt + oP (1).
lik N lik lik (S.42)
where the second equality is from (S.41). In addition, the conditional variance of each component of
ϑ̃
Hlik is bounded by
a−4 ϑ̃ ϑ̃ ∗ 2 ∗ 4
N N E{δ {φπ (xi ; θ )} kġ(xi ; θ )k | ϑ̃}
≤ a−4 os ϑ̃ ∗ ∗ 4
N N E[{y + (1 − y)π% (x; ϑ̃)}pπ (xi ; θ )kġ(xi ; θ )k | ϑ̃]
≤ a−4 ∗ os ϑ̃ ∗ ∗ 4
N N E[{p(x; θ ) + π% (x; ϑ̃)}pπ (xi ; θ )kġ(xi ; θ )k | ϑ̃]
∗ ∗ ∗
≤ 2a−4 α
N N e E{e
f (x;β )
kġ(xi ; θ ∗ )k4 } = 2a−2
N E{e
f (x;β )
kġ(xi ; θ ∗ )k4 } = o(1). (S.43)
ϑ̃
From (S.42) and (S.43), Chebyshev’s inequality implies that Hlik → Λplt
lik in probability.
S.10 Experiments on Pilot Mis-specification

This section complements Section 6.1 in the main text. We generate all data with a logistic regression
model with g(x; θ) = α + xT β. The total number of data is N = 5 × 105 with N0 negative samples
and N1 positive samples. When generating data, we tune the bias α to force N0 /N1 ≈ 400. In
order to test the robustness of our methods, we add an artificial biased to the pilot model, which will
produce skewed sampling probabilities through the equation
π%os (x; ϑ̃) = min[max{ρϕ̃os (x), %}, 1], where ϕ̃os (x) = ω̃ −1 t(x; θ̃). (S.44)
We try two types of biases: one on the intercept term α and the other on weights β.
S.10.1 Perturb the pilot intercept α̃.
We perturb α̃ ← α̃ + ξ × U(0, 1) × log(N0 /N1 ), where ξ is a hyperparameter controlling the degree

of perturbation. We try ξ = {0.1, 0.5, 1.0}. As the result will show, perturbing the intercept will not
affect the relative performance for various types of sampling algorithms.
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.4: Degree of perturbation ξ = 0.1. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
26
S.10.2 Perturb the pilot weights β.
We perturb β̃ ← β̃ + ξ × N (0, I), where ξ is a hyperparameter to control the degree of perturbation.

We try ξ = {0.1, 0.5, 1.0}. As the result shows, the IPW estimator is pretty sensitive to the perturba-
tion, and optW can be even worse than uniLik when the perturbation is significant (Figure S.9). uniLik
may exceed optLik when the perturbation is too large such that the optimal sampling probability is
significatly mis-calculated (Figure S.9(d)).
27
S.11 Experiments on Model Mis-specification
This section studies the case when model is mis-specified. That is, the ground truth model fall out
of the domain of linear logistic regression model. We design the ground truth model as a two-layer
neural network with non-linear activations:
1
g(x; θ) = α + tanh(ξ · xT W)β, (S.45)
ξ
where ξ is a constant to control the “cut-off" threshold of the non-linear activation tanh. Note that as
ξ → 0, the scaled tanh function 1ξ · tanh(ξ · x) ≈ x as shown in Figure S.10. Thus, by tuning ξ we
have the control of the deviation of scaled tanh function from a linear, identity transformation.
8 =1
= 0.5
6 = 0.3
= 0.1
4
2
x)
1 tanh(
8
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
Figure S.10: Scaled tanh function approximates identity transformation when ξ → 0.
We also generate W to be an approximately identity matrix, by setting the diagonal elements of W

to be 1, and draw off diagonal terms to be i.i.d. Normal distribution scaled by a factor ξw . Clearly,
the operator norm kW − Ik2→2 is controlled by ξw . When ξw = 0, W = I.
When perturbing the ground-truth model by setting ξ and ξw (thus W), we need to reset the intercept
α to match the pos/neg proportion to be 1/400. For each experiment setup, we record the tuned α in
the caption for reproducible research.
S.11.1 Ablation study.
When ξ and ξw are extremely small, g(x; θ) ≈ α + xT β. Empirical results shown in Figure S.11
match results without model mis-specification.
28
Figure S.11: Degree of non-linearity ξ = 10−3 , ξw = 10−3 . Ground-truth intercepts: (a) α = −7.65,
(b) α = −0.5, (c) α = −7, (d) α = −1.8. Log (MSEs) of subsample estimators for different sample
sizes (the smaller the better).
S.11.2 Tune the degree of activation function non-linearity by trying different ξ.
We fix ξw = 10−3 and tune ξ to study the effect of non-linear activation. Note that when ξ → 0,
1 1
ξ tanh(ξx) ≈ x and when ξ → ∞, ξ tanh(ξx) ≈ 0. As we shall see, the value ξ significantly affect
the non-linearity of the model. In Figure S.12, we choose a relatively small ξ = 0.1 and observe an
almost indistinguishable loss in log(MSE). In Figure S.13 we study the case when ξ = 0.5 and find
that it significantly affect the model estimation when x has long tails (Figure S.13(b), (d)). When
ξ = 1, all method crash since the true (non-linear) model deviate from the tractable domain (linear).
In order to avoid this happen, we suggest to re-scale and centering the data before feeding them to
the model. Note that when models are not crashed, optLik still achieves best performance among all
methods.
Figure S.12: Degree of non-linearity ξ = 10−1 , ξw = 10−3 . Ground-truth intercepts: (a) α = −7.7,
(b) α = −0.6, (c) α = −6.9, (d) α = −1.9. Log (MSEs) of subsample estimators for different
Figure S.13: Degree of non-linearity ξ = 5·10−1 , ξw = 10−3 . Ground-truth intercepts: (a) α = −7.4,
29
Figure S.14: Degree of non-linearity ξ = 1.0, ξw = 10−3 . Ground-truth intercepts: (a) α = −7.05,
S.11.3 Tune kW − Ik2→2 by trying different ξw .
We fix ξ = 10−3 and focus on studying the effect of generative process of W to the estimation error.
In Figure S.15, we set ξw = 2·10−2 to be a relatively small number such that kW −Ik2→2 ≈ 0.20 <
1. In this case, each sampling method suffers from model mis-specification with higher log(MSE).
Still optLik is the best among all generative processes of x. In Figure S.16, we set ξw = 10−1 and
kW − Ik2→2 ≈ 1. In this case, the noise level is comparable to the ground-truth. All sampling
models crash such that the estimated models deviate significantly from the ground-truth.
−3 −2
Figure S.15: Degree of non-linearity ξ = 10 , ξw = 2·10 . Ground-truth intercepts: (a) α = −7.7,
−3 −1
Figure S.16: Degree of non-linearity ξ = 10 , ξw = 10 . Ground-truth intercepts: (a) α = −8.25,
References
Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data Mining and
Knowledge Discovery Handbook, 875–886. Springer.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from
imbalanced data sets. ACM SIGKDD Explorations Newsletter 6, 1, 1–6.
Chen, J., Sun, B., Li, H., Lu, H., and Hua, X.-S. (2016). Deep ctr prediction in display advertising.
In Proceedings of the 24th ACM International Conference on Multimedia (MM), 811–820.
30
Cheng, G., Huang, J. Z., et al. (2010). Bootstrap consistency for general semiparametric m-estimation.
The Annals of Statistics 38, 5, 2884–2915.
Douzas, G. and Bacao, F. (2017). Self-organizing map oversampling (somo) for imbalanced data set
learning. Expert Systems with Applications 82, 40–52.
Drummond, C., Holte, R. C., et al. (2003). C4. 5, class imbalance, and cost sensitivity: why under-
sampling beats over-sampling. In Workshop on Learning from Imbalanced Datasets II, vol. 11,
1–8. Citeseer.
Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from
imbalanced data sets. Computational intelligence 20, 1, 18–36.
Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced
data sets. Annals of statistics 42, 5, 1693.
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). Deepfm: a factorization-machine based neural
network for ctr prediction. arXiv preprint arXiv:1703.04247 .
Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in
imbalanced data sets learning. In D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds., Advances in
Intelligent Computing, 878–887, Berlin, Heidelberg. Springer Berlin Heidelberg.
Han, L., Tan, K. M., Yang, T., Zhang, T., et al. (2020). Local uncertainty sampling for large-scale
multiclass logistic regression. Annals of Statistics 48, 3, 1770–1788.
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al.
(2014). Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth
International Workshop on Data Mining for Online Advertising, 1–9.
Hesterberg, T. (1995). Weighted average importance sampling and defensive mixture distributions.
Technometrics 37, 2, 185–194.
Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers of convex processes. arXiv preprint
arXiv:1107.3806 .
Huang, J.-T., Sharma, A., Sun, S., Xia, L., Zhang, D., Pronin, P., Padmanabhan, J., Ottaviano, G., and
Yang, L. (2020). Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, 2553–2561.
Japkowicz, N. (2000). Learning from imbalanced data sets: Papers from the AAAI workshop, AAAI,
2000. Technical Report WS-00-05.
Kilbertus, N., Parascandolo, G., and Schölkopf, B. (2018). Generalization in anti-causal learning.
arXiv preprint arXiv:1812.00524 .
King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis 9, 2, 137–163.
Lemaître, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle
the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research
18, 1, 559–563.
Liu, X., Wu, J., and Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2, 539–550.
Mathew, J., Pang, C. K., Luo, M., and Leong, W. H. (2017). Classification of imbalanced data by
oversampling in kernel space of support vector machines. IEEE Transactions on Neural Networks
and Learning Systems 29, 9, 4065–4076.
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov,
E., Golovin, D., et al. (2013). Ad click prediction: a view from the trenches. In Proceedings
of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
1222–1230.
31
Owen, A. and Zhou, Y. (2000). Safe and effective importance sampling. Journal of the American
Statistical Association 95, 449, 135–143.
Owen, A. B. (2007). Infinitely imbalanced logistic regression. The Journal of Machine Learning
Research 8, 761–773.
Pukelsheim, F. (2006). Optimal design of experiments. SIAM.
Rahman, M. M. and Davis, D. (2013). Addressing the class imbalance problem in medical datasets.
International Journal of Machine Learning and Computing 3, 2, 224.
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On causal and
anticausal learning. In ICML 2012: Proceedings of the 29th International Conference on Machine
Learning, Edinburgh, Scotland, June 26-July 1, 2012, 1255–1262.
Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification
of imbalanced data. Pattern Recognition 40, 12, 3358–3378.
Ting, D. and Brochu, E. (2018). Optimal subsampling with influence functions. In Advances in
neural information processing systems, 3650–3659.
van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal
of Machine Learning Research 20, 132, 1–59.
Wang, H. (2020). Logistic regression for massive data with rare events. In H. D. III and A. Singh, eds.,
Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings
of Machine Learning Research, 9829–9836. PMLR.
Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika
108, 1, 99–112.
Wang, H., Zhu, R., and Ma, P. (2018). Optimal subsampling for large sample logistic regression.
Journal of the American Statistical Association 113, 522, 829–844.
Xiong, S. and Li, G. (2008). Some results on the convergence of conditional distributions. Statistics
& Probability Letters 78, 18, 3249–3253.
Yu, J., Wang, H., Ai, M., and Zhang, H. (2020). Optimal distributed subsampling for maximum
quasi-likelihood estimators with massive data. Journal of the American Statistical Association 0, 0,
1–12.
32

Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data

Uploaded by

Copyright:

Available Formats

Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data

Uploaded by

Copyright:

Available Formats

Nonuniform Negative Sampling and Log Odds

Correction with Rare Events Data

HaiYing Wang Aonan Zhang

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

2 Problem setup and full data results

Pr(y = 1 | x) = p(x; θ), (1)

where the true parameter, denoted as θ ∗ = (α∗ , β ∗ T )T , satisfies that α∗ → −∞ as N → ∞ and

2.1 It is the number of positive instances that matters

2.2 Negative sampling

Algorithm 1 Negative sampling

In ϕos (x), t(x; θ ∗ ) consists of two components, p(x; θ ∗ ) and kM−1 ∗

4.2 Theoretical analysis of the likelihood-based estimator

The following shows the asymptotic distribution of θ̂lik .

5.1 Making estimators more practical

5.2 Theoretical analysis of practical estimators

6.1 Experiments on simulated data

6.2 Experiments on real data

6.3 Sensitivity analysis

0.8 0.8 0.8 0.8 0.8 0.8 0.8

0.8 0.8 0.8 0.8 0.8 0.8

0.8 0.8 0.8 0.8 0.8 0.8

7 Conclusions, discussion, and limitations

Acknowledgments and Disclosure of Funding

S.1 Proof of Theorem 1

we know that E{a−1 ˙ ∗

α∗ +f (x;β ∗ ) 2 kġ(x; θ )k ≤ e kġ(x; θ ∗ )k2 with E{ef (x;β ) kġ(x; θ ∗ )k2 } ≤ ∞.

Now we check the Lindeberg-Feller condition. For any  > 0,

+ N E {1 − p(x; θ ∗ )}{p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(kp(x; θ ∗ )ġ(x; θ ∗ )k > aN )

≤ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )

+ N E {p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(kp(x; θ ∗ )ġ(x; θ ∗ )k > aN )

where the last step is from Lemma 28 of Wang (2019).

S.2 Proof of Theorem 2

The estimator θ̂w is the maximizer of

so aN (θ̂w − θ ) is the maximizer of γw (u) = `w (θ ∗ + a−1

in probability for any u, and ∆`w = oP (1) and Rw = oP (1).

where ú lies between 0 and u. We see that

S.3 Proof of Theorem 3

ρ min{t(x; θ ∗ ), T } ≤ E[min{t(x; θ ∗ ), T }], almost surely. (S.13)

S.4 Derivation of corrected model (4)

Note that π(x, 1) = 1 and π(x, 0) = π(x). By direct calculation, we have

where φπ (x; θ) = pπ (x; θ){1 − pπ (x; θ)},

where ú lies between 0 and u. We see that

Thus we know that E{a−1 ˙ ∗ −1 ˙ ∗

Now we check the Lindeberg-Feller condition. For any  > 0,

If c = 0 then κ(x) = 1, and we can directly verify that

S.7 Proof of Theorem 6

= E{U̇(θ; Dδ ) | X} + E{U(θ; Dδ )`˙T

S.8 Proof of Theorem 7

a−1 ˙ϑ̃ ∗ plt

in probability, and ∆ϑ̃ ϑ̃

= N E k{π%os (x, y; ϑ̃)}−1 δ ϑ̃ {y − p(x; θ ∗ )}ġ(x; θ ∗ )k2

× I(k{π%os (x, y; ϑ̃)}−1 δ ϑ̃ {y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN ) ϑ̃

kuk4 N {y − p(x; θ ∗ )}2 kg̈(x; θ ∗ )k2

From Assumption 1, E(∆ϑ̃ ϑ̃ ϑ̃ ϑ̃

S.9 Proof of Theorem 8

where ú lies between 0 and u.

Now we need to show that given ϑ̃

= N E p(x; θ ∗ ){1 − pϑ̃ ∗ 2 ∗ 2 ϑ̃ ∗ ∗

Now we check the Lindeberg-Feller condition. For any > 0,

+ N E {1 − p(x; θ ∗ )}{p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(kp(x; θ ∗ )ġ(x; θ ∗ )k > aN )

≤ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )

+ N E {p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(kp(x; θ ∗ )ġ(x; θ ∗ )k > aN )

Now we check the Lindeberg-Feller condition. For any > 0,

× I(k{π%os (x, y; ϑ̃)}−1 δ ϑ̃ {y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN ) ϑ̃

≤ N E p(x; θ ∗ )kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN ) ϑ̃