Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data
Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data
Nonuniform Negative Sampling and Log Odds Correction With Rare Events Data
Chong Wang
ByteDance Inc.
[email protected]
Abstract
We investigate the issue of parameter estimation with nonuniform negative sam-
pling for imbalanced data. We first prove that, with imbalanced data, the available
information about unknown parameters is only tied to the relatively small number
of positive instances, which justifies the usage of negative sampling. However,
if the negative instances are subsampled to the same level of the positive cases,
there is information loss. To maintain more information, we derive the asymptotic
distribution of a general inverse probability weighted (IPW) estimator and obtain
the optimal sampling probability that minimizes its variance. To further improve
the estimation efficiency over the IPW method, we propose a likelihood-based
estimator by correcting log odds for the sampled data and prove that the improved
estimator has the smallest asymptotic variance among a large class of estimators. It
is also more robust to pilot misspecification. We validate our approach on simu-
lated data as well as a real click-through rate dataset with more than 0.3 trillion
instances, collected over a period of a month. Both theoretical and empirical results
demonstrate the effectiveness of our method.
1 Introduction
Imbalanced data are ubiquitous in scientific fields and applications with binary response outputs,
where the number of instances in the positive class is much smaller than that in the negative class. For
example, in online search or recommendation systems where billions of impressions appear each day,
non-click impressions usually dominate. Using these non-click impressions as negative instances is
prevalent and proved to be useful especially in modern machine learning systems (McMahan et al.,
2013; He et al., 2014; Chen et al., 2016; Guo et al., 2017; Huang et al., 2020). A common approach
to address imbalanced data is to balance it by subsampling the negative class (Drummond et al., 2003;
Liu et al., 2009) and/or up-sampling the positive class (Chawla et al., 2002; Han et al., 2005; Mathew
et al., 2017; Douzas and Bacao, 2017), and a great deal of attentions have been drawn to this problem,
see Japkowicz (2000); King and Zeng (2001); Chawla et al. (2004); Estabrooks et al. (2004); Owen
(2007); Sun et al. (2007); Chawla (2009); Rahman and Davis (2013); Fithian and Hastie (2014);
Lemaître et al. (2017); Wang (2020), and the references therein.
In this paper, we focus on subsampling the negative class, i.e., negative sampling, while keeping
all rare positive instances. This is beneficial for modern online learning systems, where negative
sampling significantly reduces data volume in distributed storage while saving training time for
multiple downstream models. Rare events data and negative sampling are studied in Wang (2020),
• Under a general binary response model with rare events data, we find that the difference
between the full data estimator and the true parameter converges to a normal distribution at
a rate that is tied to the number of rare positive instances only. This indicates the possibility
to throw away the majority of negative instances without information loss, i.e, it justifies the
usage of negative sampling.
• We show that there is no asymptotic information loss with aggressive negative sampling if
the negative instances kept dominates the positive instances. However there is information
loss expressed as a variance inflation if the negative instances are brought down to the same
level of the positive instances. For this case, we obtain optimal subsampling probabilities by
minimizing the unconditional asymptotic variance of a general IPW estimator.
• We develop a likelihood estimator through nonuniform log odds correction for sampled data,
and prove that it has the smallest asymptotic variance among a large class of estimators.
• We apply the proposed method to a real online streaming click-through rate (CTR) dataset
with more than 0.3 trillion instances and demonstrate its effectiveness.
The rest of paper is organized as follows. Section 2 defines the problem this paper focuses on
and shows the full data results. Section 3 presents the optimal probability for the IPW estimator.
Section 4 proposes the likelihood-based estimator and establishes its asymptotic optimality. Section 5
discusses practical implementation and the theoretical properties of the resulting estimators. Section 6
presents numerical experiments with simulated data, and applies the proposed method to a real online
streaming CTR dataset with more than 0.3 trillion instances. Section 7 concludes the paper and points
out limitations of our investigation. Proofs and additional experiments with misspecifications are in
the supplementary materials.
2
small. Heuristically, this implies that a change in the feature value does make a rare event become a
large probability event. We can also allow β ∗ to change with N , but as long as β ∗ has a finite limit,
the situation is essentially the same as a fixed β ∗ . So here we assume β ∗ to be fixed to simplify the
presentation.
Throughout this paper, we use ġ(x; θ) and g̈(x; θ) to denote the gradient and Hessian of g(x; θ) with
respect to (w.r.t.) θ, respectively. Let kvk be the Frobenius norm and v⊗2 = vvT for a vector or a
matrix v. We denote π(x, y) as the sampling probability for an instance (x, y). For negative cases,
we sometimes use a shorthand notation π(x) to denote its sampling probability. We use oP (1) to
represent a random quantity that converges to 0 in probability as N → ∞.
We show that for rare events data, the estimation error rate for θ is related to N1 instead of N . Using
the full data, a commonly used estimator of θ is the maximum likelihood estimator (MLE)
N
X
yi g(xi ; θ) − log 1 + eg(xi ;θ) .
θ̂f := arg max
θ
i=1
To investigate the theoretical properties of the estimator θ̂f , we make the following assumptions.
Assumption 1. The first, second, and third derivatives of f (x; β) and ef (x;β) f (x; β) w.r.t. any
components of β are bounded by a square intergrable random variable B(x).
Assumption 2. The matrix E{ġ ⊗2 (x; θ ∗ )} is finite and positive definite.
Assumption 1 imposes constraints on the smoothness of f (x; θ) w.r.t. θ and on the tightness of the
distribution of x. Assumption 2 ensures that all components of θ are estimable.
∗ ∗
Theorem 1. Let Vf = E{ef (x;β ) }M−1
f where Mf = E{ef (x;β ) ġ ⊗2 (x; θ ∗ )}. Under Assump-
tions 1 and 2, as N → ∞,
p
N1 (θ̂f − θ ∗ ) −→ N(0, Vf ), in distribution.
This result shows that even if all the N instances are used for model training, the resulting estimator
−1/2
θ̂f converges to the true parameter θ ∗ at the rate of N1 , which is much slower than the usual rate
−1/2
of N for regular cases. This indicates that with rare events data, the available information about
unknown parameters is at the scale of N1 . Although N → ∞ much faster than N1 → ∞ (in terms
of N1 /N → 0), N does not explicitly show in the asymptotic normality result of Theorem 1. For
the specific case of linear logistic regression with g(x; θ) = α + xT β, our Theorem 1 reduces to
Theorem 1 of Wang (2020).
Since the available information ties to the number of positive instances instead of the full data size,
one can keep all the positive instances and significantly subsmaple the negative instances to reduce
the computational cost. For better estimation efficiency, we consider nonuniform sampling. Let
ϕ(x) > 0 be an integrable function and ρ be the sampling rate on the negative class. Without loss
of generality, assume E{ϕ(x)} = 1 so that π(xi ) := ρϕ(xi ) is the sampling probability for the i-th
data point if yi = 1. We present the nonuniform negative sampling procedure in Algorithm 1.
3
3 Weighted estimation and its optimal negative sampling probability
Let δi = 1 if the i-th data point is selected and δi = 0 otherwise. Given a subsample taken according
to π(xi , yi ) = yi + (1 − yi )ρϕ(xi ), i = 1, ..., N , the IPW estimator of θ is
N
X
δi π −1 (xi , yi ) yi g(xi ; θ) − log 1 + eg(xi ;θ) .
θ̂w = arg max (2)
θ
i=1
We need an assumption on the subsampling rate to investigate the subsample estimators.
∗
Assumption 3. The subsampling rate ρ satisfies that cN := eα /ρ → c for a constant c such that
0 ≤ c < ∞.
∗
Note that cN cannot be zero but its limit c can be exactly zero. It can be shown that cN E{ef (x;β ) } =
∗
N1 (N0 ρ)−1 {1 + oP (1)}, so cE{ef (x;β ) } is the asymptotic positive/negative ratio for the sample.
The theorem below shows the asymptotic distribution of θ̂w .
Theorem 2. Under Assumptions 1-3, if E[{ϕ(x) + ϕ−1 (x)}B 2 (x)] < ∞ then as N → ∞,
p
N1 (θ̂w − θ ∗ ) −→ N(0, Vw ), in distribution,
∗ ∗
where Vw = Vf + Vsub and Vsub = cE{ef (x;β ) }M−1
f E{ϕ
−1
(x)e2f (x;β ) ġ ⊗2 (x; θ ∗ )}M−1
f .
The subsampled estimator θ̂w have the same convergence rate as the full data estimator. However,
the asymptotic variance Vw may be inflated by the term Vsub from subsampling the negative cases.
Here Vsub is zero if c = 0. If we keep much more negative instances than the positive instances in
the subsample (c = 0), then there is no asymptotic information loss due to subsampling (Vw = Vf ).
In this scenario, the number of negative instances can still be aggressively reduced (ρ → 0) so that the
computational efficiency is significantly improved, and the sampling function ϕ(x) does not play an
significant role. If we have to reduce the negative instances to the same level of the positive instances
(0 < c < ∞), then the variance inflation Vsub is not negligible. In this scenario, a well designed
sampling function ϕ(x) it is more relevant and critical.
√
Theorem 2 shows that the distribution of the estimation error err := N1 (θ̂w − θ ∗ ) is approximated
by a normal distribution N(0, Vw ). Thus for any > 0 the probability of excess error Pr(err > )
Pd 2
is approximated by Pr{kN(0, Vw )k > } = Pr j=1 λj zj > , where λj ’s are eigenvalues of
Vw and zj ’s are independent standard normal random variables. Therefore, a smaller trace of Vw
Pd
means a smaller probability of excess error of the same level since tr(Vw ) = j=1 λj .
The following theorem gives the optimal negative sampling function for the IPW estimator.
Theorem 3. For a given sampling rate ρ, the asymptotically optimal ϕ(x) that minimizes tr(Vw ) is
min{t(x; θ ∗ ), T }
ϕos (x) = , (3)
E[min{t(x; θ ∗ ), T }]
where t(x; θ) = p(x; θ)kM−1 f ġ(x; θ)k and T is the maximum number so that
ρ[min{t(x; θ ∗ ), T }] ≤ E[min{t(x; θ ∗ ), T }] with probability one.
Remark 1. If ρ → 0 then ρt(x; θ ∗ ) ≤ E{t(x; θ ∗ )} almost surely, so T can be dropped (i.e., T = ∞).
If limN →∞ ρ > 0 then c = 0, so the variance inflation due to subsampling is negligible. Thus, the
truncation term T can be ignored in practice with imbalanced data; it only plays an role for not very
imbalanced data when the sampling ratio is not very small.
is from the binary response model structure, and it gives a higher preference for a data point with
larger p(x; θ ∗ ) in the negative class. The term kM−1 ∗
f ġ(x; θ )k corresponds to the A optimality
criterion (Pukelsheim, 2006). For computational benefit in optimal sampling, the A optimality is
often replaced by the L optimality (e.g., Wang et al., 2018; Ting and Brochu, 2018). Under the
L optimality, kM−1 ∗ ∗
f ġ(x; θ )k is replaced by kġ(x; θ )k, and this minimizes the asymptotic mean
squared error (MSE) of Mf θ̂w . In case the gradient is difficult to obtain, one could ignore the gradient
term and simply use p(x; θ ∗ ) to replace t(x; θ ∗ ). Although this does not give an optimal sampling
probability, it outperforms the uniform subsampling because it takes into account the information of
the model structure like the local case-control (LCC) sampling.
4
4 More efficient estimation based on the likelihood with log odds correction
4.1 Nonuniform log odds correction
The optimal function ϕos (x) assigns larger probabilities to more informative instances. However,
the IPW estimator in (2) assigns smaller weights to more informative instances in estimation, so
the resulting estimator can be improved to have higher estimation efficiency. A naive unweighted
estimator is biased, and the bias correction approach in Fithian and Hastie (2014); Wang (2019)
does not work for the sampling procedure in Algorithm 1 even when the underlying model is the
logistic regression. We adopted the idea of Han et al. (2020) and seek to define a more efficient
estimator through finding the corrected likelihood of the sampled data based on any negative sampling
probability π(x). For data included in the subsample (where δ = 1), the conditional probability of
y = 1 is
1
pπ (x; θ) := Pr(y = 1 | x, δ = 1) = −g(x;θ)−l
, (4)
1+e
where l = − log{π(x)}. Please see the detailed derivation of (4) in the supplement. To avoid the
IPW, we propose the estimator based on the log odds corrected likelihood in (4), namely,
N
X
δi yi g(xi ; θ) − log 1 + eg(xi ;θ)+li .
θ̂lik = arg max `lik (θ), where `lik (θ) = (5)
θ
i=1
With θ̂w in (2), π(xi ) is in the inverse. If an instance with much smaller π(xi ) is selected in the
sample, it dominates the objective function, making the resulting estimator unstable. With θ̂lik , this
problem is ameliorated because π(xi ) is in the logarithm in the log-likelihood of (5) and log(v) goes
to infinity much slower than v −1 as v ↓ 0.
Theorem 3 shows that the proposed estimator θ̂lik is asymptotic normal with variance Vlik . We
compare Vlik with Vw to show that θ̂lik has a higher estimation efficiency than θ̂w .
Theorem 5. If Vw and Vlik are finite and positive definite, then Vlik ≤ Vw , i.e., Vw − Vlik is
non-negative definite. The equality holds when c = 0 and in this case Vlik = Vw = Vf .
Since the estimator θ̂lik in Eg. (5) is based on the conditional log-likelihood of the sampled data, it
actually has the highest estimation efficiency among a class of asymptotically unbiased estimators.
Theorem 6. Let X = (x1 , ..., xN )T be the feature matrix and Dδ denote the sampled data. Let θ̂U
be a subsample estimator with the following asymptotic representation:
−1/2
θ̂U = U(θ ∗ ; Dδ ) + N1 oP (1), (6)
where the variance V{U(θ ∗ ; Dδ )} exists, and U(θ ∗ ; Dδ ) satisfies that E{U(θ ∗ ; Dδ ) | X} = θ ∗
and E{U̇(θ ∗ ; Dδ ) | X} = 0 with U̇(θ ∗ ; Dδ ) = ∂U(θ ∗ ; Dδ )/∂θ ∗ T . If N1 V{U(θ ∗ ; Dδ )} → VU
in probability, then VU ≥ Vlik .
Remark 2. Theorem 6 tells us that θ̂lik is statistically the most efficient among a class of estimators,
which includes both θ̂w and θ̂lik as special cases. The condition E{U(θ ∗ ; Dδ ) | X} = θ ∗ ensures that
the estimator is asymptotically unbiased. The condition E{U̇(θ ∗ ; Dδ ) | X} = 0 can be intuitively
interpreted as the requirement that the derivative of the oP (1) term in (6) is also negligible. Clearly,
this is satisfied by all unbiased estimators for which the oP (1) term in (6) is 0.
5
5 Practical considerations
In this section, we first show how we design a more practical estimator based on the previous results
and then present the theoretical analysis behind these improvements.
Like the LCC sampling (Fithian and Hastie, 2014) and the A-optimal sampling (Wang, 2019), the
ϕos (x) in (3) depends on θ, and thus a pilot value of θ, say θ̃, is required in practice. Here, θ̃ can
be constructed from a pilot sample, say (x̃i , ỹi ), i = 1, ..., ñ, taken by uniform sampling from both
classes with equal expected sample sizes. With θ̃ obtained, calculate t̃i = p(x̃i ; θ̃)kM̃−1 f ġ(x̃i ; θ̃)k
for i = 1, ..., ñ, where M̃f is the Hessian matrix of the pilot sample objective function. In case
that the Hessian matrix and gradients are hard to record, the gradient term can be dropped, i.e.,
use t̃i = p(x̃; θ̃). As mentioned in Remark 1, T can be ignored for very imbalanced data. The
expectation in the denominator of (3) can be approximated by mimicking the method of moment
estimator, namely by
ω̃ = 2N1 (ñN )−1 ỹi =1 t̃i + 2N0 (ñN )−1 ỹi =0 t̃i .
P P
(7)
In practice, it is common to set a lower threshold, say %, on sampling probabilities, to ensure that all
instances have positive probabilities to be selected. This is more critical for the IPW estimator while
the likelihood estimator is not very sensitive to very small sampling probabilities.
Denote the pilot value as ϑ̃ = (θ̃, ω̃). The following probabilities can be practically implemented
π%os (x; ϑ̃) = min[max{ρϕ̃os (x), %}, 1], where ϕ̃os (x) = ω̃ −1 t(x; θ̃), (8)
and we use this in our experiments in Section 6. The ω̃ in (7) is essentially a normalizing term so
that the expected subsample size is ρ. For some practical systems such as online models, ω̃ is often
treated as a tuning parameter. We will illustrated this in our real data example in Section 6.2.
6
6 Numerical experiments
We present simulated results and report the performance on a CTR dataset with more than 0.3 trillion
instances. More experiments with model and pilot misspecifications are available in the supplement.
We implemented the following negative sampling methods for logistic regression. 1) uniW: uniform
sampling with the IPW estimator. 2) uniLik: uniform sampling with the likelihood estimator. 3)
optW: optimal sampling with the IPW estimator. 4) optLik: optimal sampling with the likelihood
estimator. We also implement the full data MLE (Full) and the LCC for comparisons.
In each repetition of the simulation, we generate full data of size N = 5 × 105 from the logistic
regression with g(x; θ) = α + xT β. We set the true β ∗ to be a 6 × 1 vector of ones and set different
values for α for different feature distributions so that the positive/negative ratio is close to 1:400. We
consider the following feature distributions: (a) Normal distribution which is symmetric with light
tails; the true intercept is α = −7.65. (b) Log-normal distribution which is asymmetric and positively
skewed; the true intercept is α = −0.5. (c) t3 distribution which is symmetric with heavier tails; the
true intercept is α = −7.
We set the sampling rate as ρ = 0.002, 0.004, 0.006, 0.01, and 0.02 for all sampling methods. In each
repetition, uniform samples of average size 100 are selected from each class to calculate the pilot
estimates, so the uncertainty due to the pilot estimates are taken into account. We also consider pilot
misspecification by adding a uniform random number from U(0, 1.5) to the pilot so that the pilot is
systematically different from the true parameter. We repeat the simulation for R = 1000 times to
PR
calculate the MSE as R−1 r=1 kθ̂ (r) − θk2 , where θ (r) is the estimate at the r-th repetition.
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3
Figure 1: Log(MSEs) for different estimators (the smaller the better). The top row uses consistent
pilot estimators; the bottom uses misspecified pilot estimators.
Results are presented in Figure 1. The MSEs of all estimators decrease as the sampling rate ρ
increases. The optLik outperforms other sampling methods in general, and its performance is close to
the full data MLE when ρ = 0.02, especially if the pilot is consistent. The advantage of optLik over
optW is more evident for smaller ρ. If the pilot is misspecified, optLik is quite robust but optLik and
LCC significantly deteriorate. Since uniLik and uniW do not require a pilot, they are not affected
by pilot misspecification, but they have much larger MSE than optLik and optW when the pilot is
consistent, especially uniW. Different feature distributions also affect the performances of different
sampling methods.
To verify Theorem 1 numerically, we let the full data size N increase at a faster rate than the average
number of positive instances N1a = E(N1 ) so that the probability Pr(y = 1) decreases towards zero.
As N increases, we fixed the value of β ∗ and set decreasing values to α∗ to mimic our scaling regime.
We considered two covariate distributions: 1) a multivariate normal distribution for which a logistic
7
regression model is a correct model; and 2) a multivariate log-normal distribution for which a logistic
regression model is misspecified. We simulated for 100 times to calculate the empirical variance
V̂f of the full data MLE, and report the results in Table 1. It is seen that tr(V̂f ) is decreasing
towards zero. According to Theorem 1, tr(V̂f ) should converge in a rate of 1/N1 , so N1a tr(V̂f )
should be relatively stable as N increases. This is indeed the case as seen in the third and sixth
columns of Table 1. On the other hand, N tr(V̂f ) gets large dramatically as N increases. The
aforementioned observations confirm the theoretical result in Theorem 1. Furthermore, Table 1 shows
that the convergence rate in Theorem 1 may also be true for some misspecified models.
Table 1: Empirical variances of the full data MLE for different full data size and average number of
positive instances combinations.
Correct model Mis-sprcified model
(N , N1a ) tr(V̂f ) N1a tr(V̂f ) N tr(V̂f ) tr(V̂f ) N1a tr(V̂f ) N tr(V̂f )
(103 , 32) 0.169 5.41 169.17 0.969 30.99 968.70
(104 , 64) 0.097 6.20 969.29 0.322 20.59 3217.12
(105 , 128) 0.045 5.76 4497.24 0.135 17.32 13527.60
(106 , 256) 0.018 4.62 18048.40 0.046 11.74 45847.40
We conduct experiments on an internal CTR dataset from our products with more than 0.3 trillion
instances. There are 299 input features, including user and ad profiles together with rich context
features. We concatenate all features as a single high-dimensional vector x, whose length is greater
than 5, 000. We use a 3-layer fully connected deep neural network as a feature extractor that maps x
into a 256-dimensional dense features h(x). The downstream model is a linear logistic regression
model taking h(x) as input to predict the binary output whether the user clicks the ad or not.
The positive/negative ratio is around 1:80. Due to limited storage, we first uniformly drop 80%
negative instances and try various negative sampling algorithms on the rest 60 billion instances. On
this pre-processed dataset, we then split out 1 percent of the data for testing, and do negative sampling
on the rest data. The entire training procedure scans the subsampled data set from begin to end in
one-pass according to the timestamp. The testing is done all the way through training. We calculate
the area under the ROC curve (AUC) for testing instances. The entire model is trained on a distributed
machine learning platform using 1,200 CPU cores and can finish within two days.
We adopt the sampling probability π%os (x; ϑ̃) as demonstrated in (8) and use the empirical sampling
rate to calibrate the normalizing term ω̃. As mentioned before, we remove the gradient term for
implementation tractability so t(x; θ̃) = p(x; θ̃). As the result will show, this approximate score
produces consistent results with the simulation experiments. We compare four methods: 1) uniW and
2) uniLik are the same as in simulated experiments; and 3) optW and 4) optLik represent nonuniform
sampling probability using π%os (x; ϑ̃) with the IPW estimator and with the likelihood estimator. We
use “opt" to refer to both optW and optLik when focusing on the sampling.
We study two sets of negative sample rates (w.r.t. the original 0.3 trillion data): (i) [0.01, 0.05]. In this
case, negative instances still outnumber positive instances. (ii) [0.001, 0.008]. In this case, negative
subsampling is too aggressive that positive instances dominates negative instances. We demonstrate
the result in Figure 2. When the sample rate is moderate (Figure 2.(b)), opt is nearly optimal and can
significantly outperform uniform subsampling. There is still a small gap between the IPW estimator
and the likelihood estimator. When the sample rate is extremely small (Figure 2.(a)), opt becomes
sub-optimal and its gap w.r.t. uniform sampling is closer. Still we see a clear difference between the
IPW estimator and the likelihood estimator. This is due to a larger asymptotic variance for uniform
sampling when the sampling rate goes to zero. Note that on this huge data set, a small relative AUC
gain (e.g., 0.0005) usually could bring about significant revenue gain in products.
In practice, we find it necessary to use cross-validation to find the optimal % in (8) given a fixed sample
rate. As demonstrated in Figure 3, for each sampling rate we tune % from a very small value (in this
8
(a). Extremely small sampling rates. (b) Moderate sampling rates.
Figure 2: Empirical testing AUC of subsample estimators for different sample sizes (the larger the
better).
case, 10−6 ) all the way to its largest value. When % achieves its largest value, ρω̃ −1 = 0 and π%os (x; ϑ̃)
reduces to uniform negative sampling. We observe that π%os (x; ϑ̃) with a moderate % achieves the
best results for various sampling rates. Note that % is the lower truncation level for negative sampling
probabilities. Thus, a larger % means more “simple" negative instances (ρω̃ −1 p(xi ; θ̃) is smaller than
%) will have better chances to be selected. This coincides with recent empirical results for negative
subsampling in large search systems (Huang et al., 2020), where they mix “simple" (smaller p(xi ; θ̃))
and “hard" (larger p(xi ; θ̃)) negative instances. We also observe an empirical variance of around
0.0001 for each setup, demonstrating that the relative improvement is consistent.
uniW uniLik optW optLik
1.0
Sample Rate = 0.05 Sample Rate = 0.04 Sample Rate = 0.03 Sample Rate = 0.02 Sample Rate = 0.01
6
13 13 13 14 14 14 14
0.8 0.8 0.8 0.8 0.8 0.8 0.8
4 6 8 0 2 4 6 8
13 13 13 14 14 14 14 14
00 25 50 75 00 25 50
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
4
4
15 .815 .815
15
12 12 12 12 13 13 13
0.8
0.8
2
2
8
14
0.8
0
0
0
6
14
0.8
0.8
8
8
14
4
14
0.8
6
0.8
0.6
46
81
4
14
AUC 0.
0.8
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.01 0.02 0.03 0.04 0.05
Sample Rate = 0.008 Sample Rate = 0.006 Sample Rate = 0.004 Sample Rate = 0.002 Sample Rate = 0.001
0.4
0
12
50 75 00 25 50 75 00 25
0.8
11 11 12 12 12 12 13 13
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
0
12
0
07 08 08 09 09 10
05 06 06 07 07 08
0.8
5
20
0
1
11
0.8
0.2
0
0.8
15
05
5
5
1
1
0.8
0.8
0
0
0
10
0
11
0.8
0.8
5
5
0.0
5
09
0.0
0.00 0.01 0.02 0.03 0.04 0.20.000 0.005 0.010 0.015 0.020 0.025 0.030 0.4 0.000 0.005 0.010 0.015 0.020 0.6 0.000 0.002 0.004 0.006 0.008 0.0100.8 0.000 0.001 0.002 0.003 0.004 0.005
1.0
5
0.8
10
0.8
Figure 3: AUC for IPW and likelihood estimators when tuning the truncation lower bound %.
9
future studies. 2) Oversampling the rare positive instances is another common practice to deal with
imbalanced data. Its application together with negative sampling is not considered. 3) The assumption
of a fixed β ∗ means that both the marginal and conditional probabilities for a positive instance to
occur are small. If the proportion of the positive instances for a given feature in a dataset is large,
this assumption may not be appropriate. Further investigation is need to see if our results is still
applicable. In the following, we present three scenarios that the scaling regime may or may not fit in:
Scenario 1 (phone call while driving) Car crashes occur with small probabilities, and making phone
calls while driving significantly increases the probability of a car crash. However, the probability
of car crashes among people making phone calls while driving is still small, so for these types of
features, our scaling regime is appropriate to model the rare events.
Scenario 2 (anti-causal learning) Anti-causal learning Schölkopf et al. (2012); Kilbertus et al. (2018)
assumes that label (y) causes observations (x). Thus Pr(x|y) represents the causal mechanism and
is fixed. One standard example is that diseases (y) cause symptoms (x). Our scaling regime fits the
framework of anti-causal learning. To see this, using Bayes Theorem we write the log odds as
Pr(y = 1) Pr(x | y = 1)
α + f (x; β) = log + log .
Pr(y = 0) Pr(x | y = 0)
In anti-causal learning, the marginal distribution of y changes while the conditional distribution of x
given y is fixed. Thus only the scale factor α changes, and f (x; β) is fixed.
Scenario 3 (local transmission of the COVID-19) Our scaling regime has some limitations and does
not apply to all types of rare events data. For example, although the COVID-19 infection rate for the
whole population is low, the infection rate for people whose family members are in the same house
with positive test results is high. This means the change of a family member’s test result converts a
small-probability-event to a large-probability-event, and our scaling regime would not be appropriate.
10
Proofs and Additional Numerical Experiments
Before presenting the proof, we point out that Assumption 2 implies that E{v(x)ġ ⊗2 (x; θ)} is
positive definite for a positive function v(x) > 0 almost surely. This is because Assumption 2 means
that Pr{lT ġ(x; θ) 6= 0} > 0 for any l 6= 0, and therefore Pr{v 1/2 (x)lT ġ(x; θ) 6= 0} > 0 for any
l 6= 0, implying that E{v(x)ġ ⊗2 (x; θ)} is positive definite. We also point out that if a sequence is
oP (1) conditionally then it is also oP (1) unconditionally and vice versa (Xiong and Li, 2008; Cheng
et al., 2010).
√
To facilitate the presentation, define aN = N eα∗ . We notice that
α∗ +f (x;β∗ )
e ∗
N1 = N E α ∗ +f (x;β ∗ ) {1 + oP (1)} = a2N E{ef (x;β ) }{1 + oP (1)}, (S.1)
1+e
√
from the dominated convergence theorem. Thus the normalizing term N1 for the asymptotic
∗
normality in the paper can be replaced by aN E1/2 {ef (x;β ) }.
∗
so aN (θ̂ − θ ) is the maximizer of
γ(u) = `(θ ∗ + a−1 ∗
N u) − `(θ ).
By Taylor’s expansion,
N
X
T˙ ∗
γ(u) = a−1 −2
N u `(θ ) + 0.5aN φ(xi ; θ ∗ ){uT ġ(xi ; θ ∗ )}2 + ∆` + R,
i=1
where
N
˙ ∂`(θ) X
`(θ) = = yi − p(xi ; θ) ġ(xi ; θ),
∂θ i=1
N
1 X
yi − p(xi ; θ ∗ ) uT g̈(xi ; θ ∗ )u,
∆` =
2a2N i=1
and R is the remainder term. By direct calculation,
N
1 X
φ(xi ; θ ∗ + a−1 ∗ −1 ∗ −1
T 3
R= 3 N ú) 1 − 2p(xi ; θ + aN ú) {u ġ(xi ; θ + aN ú)}
6aN i=1
N
1 X
+ φ(xi ; θ ∗ + a−1 T ∗ −1 T ∗ −1
N ú){u ġ(xi ; θ + aN ú)}{u g̈(xi ; θ + aN ú)u}
6a3N i=1
N d
1 X ∗ −1
X ... ∗ −1
+ 3 yi − p(xi ; θ + aN ú) g
uj1 uj2 j1 j2 (xi ; θ + aN ú)u ,
6aN i=1 j ,j =1
1 2
...
where g j1 j2 (x; θ) is the gradient of g̈j1 j2 (x, θ) and ú lies between 0 and u. We see that
N N
kuk3 X ∗ −1 ∗ −1 dkuk3 X
|R| ≤ p(xi ; θ + aN ú)C(x i , θ + aN ú) + yi B(xi )
6a3N i=1 6a3N i=1
−1 N N
kuk3 eaN kuk X −1 ∗ −1 dkuk3 X
≤ exp{f (xi ; β + aN ú(−1) )}C(xi , θ + aN ú) + yi B(xi )
6N aN i=1
6a3N i=1
11
−1 N N
kuk
d3 kuk3 eaN X dkuk3 X
≤ B(xi ) + yi B(xi ) ≡ ∆R1 + ∆R2 , (S.2)
6N aN i=1
6a3N i=1
Pd ...
where C(x, θ) = kġ(x, θ)k3 + kġ(x, θ)kkg̈(x, θ)k + j1 j2 k g j1 j2 (x, θ)k and ú(−1) is ú with the
first element removed. From Assumption 1, E(∆R1 ) → 0 and
N
dkuk3 X f (xi ;β∗ ) dkuk3 f (x;β∗ )
E(∆R2 ) ≤ E e B(xi )} = E e B(x) → 0.
6N aN i=1 6aN
Since ∆R1 and ∆R2 are both positive, Markov’s inequality shows that they are both oP (1) so
R = oP (1). For ∆` , the mean E(∆` ) = 0 and the variance satisfies that
N N
kuk4 X ∗ ∗ 2 kuk4 X ∗
V(∆` ) ≤ E{p(x i ; θ )kg̈(x i ; θ )k } ≤ E[ef (xi ;β ) kg̈(xi ; θ ∗ )k2 ]
4a4N i=1 4N a2N i=1
kuk4 ∗
= 2 E[ef (x;β ) kg̈(x, θ ∗ )k2 ] → 0,
4aN
so ∆` = oP (1).
If we can show that
a−1 ˙ ∗
N `(θ ) −→ N 0, Mf , (S.3)
in distribution, and
N
X
a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Mf , (S.4)
i=1
in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that
aN (θ̂ − θ ∗ ), the maximizer of γ(u), satisfies that
−1 ˙ ∗
aN (θ̂ − θ ∗ ) = M−1
f × aN `(θ ) + oP (1). (S.5)
Slutsky’s theorem together with (S.3) and (S.5) implies the result in Theorem 1. We prove (S.3) and
(S.4) in the following.
Note that
N
X
˙ ∗) = yi − p(xi ; θ ∗ ) ġ(xi ; θ ∗ )
`(θ
i=1
is a summation of i.i.d. quantities. Since the distribution of {y − p(x; θ ∗ )}ġ(x; θ ∗ ) depends on
N because α∗ → −∞ as N → ∞, we need to use the Lindeberg-Feller central limit theorem for
triangular arrays (see, Section ∗ 2.8 of van der Vaart, 1998).
We examine the mean and variance of a−1 ˙ ∗
N `(θ ). For the mean, from the fact that
h i
E[{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ )] = E E yi − p(xi ; θ ∗ ) xi ġ(xi ; θ ∗ )xi = 0,
12
and
∗
ef (x;β ) ∗ 2 f (x;β ∗ ) ∗
N
X h i
E k{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ )k2 I(k{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ )k > aN )
i=1
h i
= N E k{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(k{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
= N E p(x; θ ∗ ){1 − p(x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
where ∗
cef (x;β ) f (x;β∗ ) ⊗2
∗
Mw = E 1 + e ġ (x; θ ) .
ϕ(x)
We then point out that under assumptions 1 and 2 the condition E[{ϕ(x) + ϕ−1 (x)}B 2 (x)] < ∞
implies the following:
∗
E ϕ(x)ef (x;β ) kġ(x; θ ∗ )k2 < ∞,
(S.6)
−1
2f (x;β∗ ) ∗ 4
E 1 + ϕ (x) e kġ(x; θ )k < ∞, (S.7)
13
∗
E ϕ−1 (x)e2f (x;β ) kg̈(x; θ ∗ )k2 < ∞.
(S.8)
N
1 T˙ 1 X δi
γw (u) = u `w (θ ∗ ) + 2 φ(xi ; θ ∗ )(zT 2
i u) + ∆`w + Rw ,
aN 2aN i=1 π(xi , yi )
where
N
∂`w (θ) X δi
`˙w (θ) =
= yi − p(xi ; θ) ġ(xi , θ),
∂θ i=1
π(xi , yi )
N
1 X δi
yi − p(xi ; θ ∗ ) uT g̈(xi , θ ∗ )u,
∆`w = 2
2aN i=1 π(xi , yi )
and Rw is the remainder. Similarly to the proof of Theorem 1, we only need to show that
a−1 ˙ ∗
N `w (θ ) −→ N(0, Mw ), (S.9)
in distribution,
N
X δi
a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Mf , (S.10)
i=1
π(xi , yi )
Now we check the Lindeberg-Feller condition (Section ∗ 2.8 of van der Vaart, 1998). Denote
δ = y + (1 − y)I{u ≤ ρϕ(x)} where u ∼ U(0, 1). For any > 0,
N
X
E kηi k2 I(kηi k > aN )
i=1
= N E kπ −1 (x, y)δ{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(kπ −1 (x, y)δ{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h i
= ρN E ϕ(x)kπ −1 (x, y){y − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(kπ −1 (x, y){y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h
+ N E {1 − ρϕ(x)}kπ −1 (x, y)y{y − p(x; θ ∗ )}ġ(x; θ ∗ )k2
14
i
× I(kπ −1 (x, y)y{y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h i
= ρN E ϕ(x)p(x; θ ∗ )k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h
+ N E {1 − p(x; θ ∗ )}ρ−1 ϕ−1 (x)kp(x; θ ∗ )ġ(x; θ ∗ )k2
i
× I(kρ−1 ϕ−1 (x){y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
h i
+ N E {1 − ρϕ(x)}p(x; θ ∗ )k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k2 I(k{1 − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
∗
h ∗
i
≤ ρN eα E ϕ(x)ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗
h ∗
i
+ N e2α ρ−1 E ϕ−1 (x)kef (x;β ) ġ(x; θ ∗ )k2 I(kρ−1 ϕ−1 (x){y − p(x; θ ∗ )}ġ(x; θ ∗ )k > aN )
∗
h ∗
i
+ N eα E ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗
= o(N eα ),
where the last step is due to Assumptions 1, (S.6)-(S.8), the dominated convergence theorem, and the
facts that aN → ∞ and limN →∞ eα /ρ = c < ∞. Thus, applying the Lindeberg-Feller central limit
theorem (Section ∗ 2.8 of van der Vaart, 1998) finishes the proof of (S.9).
Now we prove (S.10). Let
N
X δi
Hw ≡ a−2
N φ(xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ )
i=1
π(xi , yi )
N ∗
1 X δi ef (xi ;β )
= ġ ⊗2 (xi ; θ ∗ )
n i=1
π(xi , yi ) (1 + eα∗ +f (xi ;β∗ ) )2
N ∗
1 X yi + (1 − yi )I{ui ≤ ρϕ(xi )} ef (xi ;β )
= ġ ⊗2 (xi ; θ ∗ ).
n i=1 yi + (1 − yi )ρϕ(xi ) (1 + eα∗ +f (xi ;β∗ ) )2
We notice that
∗
ef (x;β )
∗
ġ ⊗2 (x; θ ∗ ) = E ef (x;β ) ġ ⊗2 (x; θ ∗ ) + o(1),
E(Hw ) =E (S.11)
(1 + eα∗ +f (x;β∗ ) )2
where the last step is by the dominated convergence theorem. In addition, the variance of each
component of Hw is bounded by
1 δ ∗
E 2 e2f (x;β ) kġ(x; θ ∗ )k4
N π (x, y)
1 1 ∗
≤ E e2f (x;β ) kġ(x; θ ∗ )k4
N π(x, y)
n
1 1 o 2f (x;β∗ )
≤ E ρ+ e kġ(x; θ ∗ )k4 = o(1). (S.12)
Nρ ϕ(x)
∗ ∗
where the last step is because of (S.7) and the fact that N eα → ∞ and eα /ρ → c < ∞
imply that N ρ → ∞. From (S.11) and (S.12), Chebyshev’s inequality implies that Hw →
∗
E ef (x;β ) ġ ⊗2 (x; θ ∗ ) in probability.
In the following we finish the proof by showing that ∆`w = oP (1) and Rw = oP (1). For ∆`w , the
mean E(∆`w ) = 0 and the variance satisfies that
kuk4 N {y − p(x; θ ∗ )}2 kg̈(x; θ ∗ )k2
V(∆`w ) ≤ E
4a4N y + (1 − y)ρϕ(x)
4
p(x; θ ∗ )
kuk N ∗ ∗ 2
≤ E 1 + p(x; θ )kg̈(x; θ )k
4a4N ρϕ(x)
15
∗ ∗
kuk4 eα ef (x;β ) f (x;β∗ )
∗ 2
≤ E 1 + e kg̈(x; θ )k = oP (1),
4a2N ρϕ(x)
so ∆`w = oP (1). For the remainder term Rw . By direct calculation,
N
1 X δi
φ(xi ; θ ∗ + a−1 ∗ −1 ∗ −1
T 3
Rw = 3 N ú) 1 − 2p(xi ; θ + aN ú) {u ġ(xi ; θ + aN ú)}
6aN i=1 π(xi , yi )
N
1 X δi
+ 3 φ(xi ; θ ∗ + a−1 T ∗ −1 T ∗
N ú){u ġ(xi ; θ + aN ú)}{u g̈(xi ; θ + aN ú)u}
−1
6aN i=1 π(xi , yi )
N d
1 X δi X ...
yi − p(xi ; θ ∗ + a−1 ∗ −1
+ 3 N ú) u u g
j1 j2 j1 j2 (x i ; θ + a N ú)u ,
6aN i=1 π(xi , yi ) j j 1 2
16
which is non-random. We notice that
ζ t (x; θ ∗ )
2 2
E
ϕ(x)
ζ [min{t(x; θ ∗ ), T }]2 ζ {t (x; θ ∗ ) − T 2 }
2 2 2
=E +E I{t(x; θ ∗ ) > T }
ϕ(x) ϕ(x)
!
2
ζ min{t(x; θ ∗ ), T }
p
− E{ϕ(x)} + 2E ζ min{t(x; θ ∗ ), T }
=E ϕ(x) − p
ϕ(x)
∗
2 2 2
ζ {t (x; θ ) − T } ∗
+E I{t(x; θ ) > T }
ϕ(x)
2 !
ζ min{t(x; θ ∗ ), T } ζ {t (x; θ ∗ ) − T 2 }
2 2
p ∗
=E ϕ(x) − p +E I{t(x; θ ) > T } + 1
ϕ(x) ϕ(x)
≡ E1 + E2 + 1,
where I(·) is the indicator function. Note that Ei ’s are non-negative and E1 = 0 if and only if
min{t(x; θ ∗ ), T }
ϕ(x) = ζ min{t(x; θ ∗ ), T } = ,
E[min{t(x; θ ∗ ), T }]
which is attainable because
If t(x; θ ∗ ) ≤ T almost surely, then E2 = 0 and the proof finishes. If Pr{t(x; θ ∗ ) > T } > 0 then
ρT
= 1,
E[min{t(x; θ ∗ ), T }]
because if this is not true then we can find a larger T that satisfies (S.13). This means that if
t(x; θ ∗ ) > T , then
min{t(x; θ ∗ ), T }
ϕ(x) = = ρ−1 ,
E[min{t(x; θ ∗ ), T }]
which minimizes E2 as well since ϕ(x) is in the denominator. This finishes the proof.
Pr(y = 1 | x, δ = 1)
Pr(y = 1, δ = 1 | x)
=
Pr(δ = 1 | x)
Pr(y = 1, δ = 1 | x)
=
Pr(y = 1, δ = 1 | x) + Pr(y = 0, δ = 1 | x)
Pr(y = 1 | x) Pr(δ = 1 | x, y = 1)
=
Pr(y = 1 | x) Pr(δ = 1 | x, y = 1) + Pr(y = 0 | x) Pr(δ = 1 | x, y = 0)
p(x; θ ∗ )π(x, 1)
=
p(x; θ )π(x, 1) + {1 − p(x; θ ∗ )}π(x, 0)
∗
∗
eg(x;θ )
=
eg(x;θ∗ ) + π(x)
∗
eg(x;θ )+l
= .
1 + eg(x;θ∗ )+l
17
S.5 Proof of Theorem 4
The estimator θ̂lik is the maximizer of `lik (θ), so uN = aN (θ̂lik − θ) is the maximizer of
γlik (u) = `lik (θ + a−1
N u) − `lik (θ).
By Taylor’s expansion,
N
X
T˙
γlik (u) = a−1 ∗
N u `lik (θ ) + 0.5a−2
N φπ (xi ; θ ∗ ){uT ġ(xi ; θ ∗ )}2 + ∆`lik + Rlik ,
i=1
18
∗ ∗
≤ eα ρ−1 E{ef (x;β ) ϕ−1 (x)B(x)} + E{B(x)}.
Therefore, E(∆Rlik 1 ) → 0. Since ∆Rlik 1 is positive, Markov’s inequality shows that ∆Rlik 1 = oP (1)
and thus Rlik = oP (1).
For ∆`lik , the mean E(∆`lik ) = 0 and the variance satisfies that
N
kuk4 X kuk4 ∗
V(∆`lik ) ≤ 4 E{δi pπ (xi ; θ ∗ )kg̈(xi ; θ ∗ )k2 } ≤ E[ef (x;β ) kg̈(x, θ ∗ )k2 ] → 0,
4aN i=1 4a2N
∗
where the last step is because Assumption 1 implies that ef (x;β ) kg̈(x, θ ∗ )k2 is integrable, so
∆`lik = oP (1).
If we can show that
a−1 ˙ ∗
N `lik (θ ) −→ N 0, Λlik , (S.15)
in distribution, and
N
X
a−2
N δi φπ (xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ) −→ Λlik , (S.16)
i=1
in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that
aN (θ̂lik − θ ∗ ), the maximizer of γlik (u), satisfies that
−1 ˙
aN (θ̂lik − θ ∗ ) = Λ−1 ∗
lik × aN `lik (θ ) + oP (1). (S.17)
Slutsky’s theorem together with (S.15) and (S.17) implies the result in Theorem 1. We prove (S.15)
and (S.16) in the following.
Note that `˙lik (θ ∗ ) is a summation of i.i.d. quantities, δi yi − pπ (xi ; θ ∗ ) ġ(xi ; θ ∗ )’s, whose distri-
bution depends on N . Note that
E[δ{y − pπ (x; θ ∗ )}ġ(x; θ ∗ )]
= E[π(x, y){y − pπ (x; θ ∗ )}ġ(x; θ ∗ )]
= E p(x; θ ∗ ){1 − pπ (x; θ ∗ )} − {1 − p(x; θ ∗ )}ρϕ(x)pπ (x; θ ∗ ) ġ(x; θ ∗ ) = 0,
(S.18)
where the last step is by a direct calculation from (S.14) to obtain the following equality
p(x; θ ∗ ){1 − pπ (x; θ ∗ )} = {1 − p(x; θ ∗ )}ρϕ(x)pπ (x; θ ∗ ). (S.19)
V{a−1 ˙ ∗
N `lik (θ )}
= a−2 ∗ 2 ⊗2
(x; θ ∗ )
N N E δ{y − pπ (x; θ )} ġ
∗
= e−α E {y + (1 − y)ρϕ(x)}{y − pπ (x; θ ∗ )}2 ġ ⊗2 (x; θ ∗ )
∗
= e−α E p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 + {1 − p(x; θ ∗ )}ρϕ(x)p2π (x; θ ∗ ) ġ ⊗2 (x; θ ∗ )
(S.20)
∗
= e−α E p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 + p(x; θ ∗ ){1 − pπ (x; θ ∗ )}pπ (x; θ ∗ ) ġ ⊗2 (x; θ ∗ )
(S.21)
∗
= e−α E {1 − pπ (x; θ ∗ )}2 + {1 − pπ (x; θ ∗ )}pπ (x; θ ∗ ) p(x; θ ∗ )ġ ⊗2 (x; θ ∗ )
(S.22)
∗
= e−α E {1 − pπ (x; θ ∗ )}2 + φπ (x; θ ∗ ) p(x; θ ∗ )ġ ⊗2 (x; θ ∗ )
(S.23)
∗
= E {1 − pπ (x; θ ∗ )}2 + φπ (x; θ ∗ ) {1 − p(x; θ ∗ )}ef (x;β ) ġ ⊗2 (x; θ ∗ )
(S.24)
where the forth equality uses (S.19). Note that elements of the term in the expectation of (S.24) are all
∗
bounded by integrable random variable ef (x;β ) kġ(x; θ ∗ )k2 . Thus, from the dominated convergence
theorem and the fact that
{1 − pπ (x; θ ∗ )}2 + φπ (x; θ ∗ ) {1 − p(x; θ ∗ )}
19
1
= 1 − pπ (x; θ ∗ ) + oP (1) = + oP (1),
1+ cϕ−1 (x)ef (x;β∗ )
we have
f (x;β∗ ) ⊗2
ġ (x; θ ∗ )
˙ e
V{a−1 ∗
N `lik (θ )} −→E = Λlik . (S.25)
1 + cϕ−1 (x)ef (x;β∗ )
+ N E {1 − p(x; θ ∗ )}ρϕ(x){pπ (x; θ ∗ )}2 kġ(x; θ ∗ )k2 I(ρϕ(x)pπ (x; θ ∗ )kġ(x; θ ∗ )k > aN )
∗ ∗
≤ N E eα +f (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗ ∗ ∗ ∗
+ N E eα +f (x;β ) pπ (x; θ ∗ )kġ(x; θ ∗ )k2 I(eα +f (x;β ) kġ(x; θ ∗ )k > aN )
∗
≤ a2N E ef (x;β ) kġ(x; θ ∗ )k2 I(kġ(x; θ ∗ )k > aN )
∗ ∗ ∗
+ a2N E ef (x;β ) kġ(x; θ ∗ )k2 I(eα +f (x;β ) kġ(x; θ ∗ )k > aN )
= o(a2N ),
where the inequality in the third step uses the following fact derived from (S.14)
∗
+f (x;β ∗ ) ∗
+f (x;β ∗ )
ρϕ(x)pπ (x; θ ∗ ) = eα {1 − pπ (x; θ ∗ )} ≤ eα , (S.26)
and the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller
central limit theorem (Section ∗ 2.8 of van der Vaart, 1998), we finish the proof of (S.15).
Now we prove (S.16). Denote
N
X
Hlik = a−2
N δi φπ (xi ; θ ∗ )ġ ⊗2 (xi ; θ ∗ ).
i=1
We notice that
E(Hlik ) = N a−2 ∗ ⊗2
N E[{y + (1 − y)ρϕ(x)}φπ (x; θ )ġ (x; θ ∗ )]
∗
= e−α E[{1 − p(x; θ ∗ )}ρϕ(x)pπ (x; θ ∗ ){1 − pπ (x; θ ∗ )}ġ ⊗2 (x; θ ∗ )]
∗
+ e−α E[p(x; θ ∗ )pπ (x; θ ∗ ){1 − pπ (x; θ ∗ )}ġ ⊗2 (x; θ ∗ )]
∗
= e−α E[p(x; θ ∗ ){1 − pπ (x; θ ∗ )}2 ġ ⊗2 (x; θ ∗ )]
∗
+ e−α E[{1 − p(x; θ ∗ )}ρϕ(x)p2π (x; θ ∗ )ġ ⊗2 (x; θ ∗ )]
= V{a−1 `˙lik (θ ∗ )} = Λlik + o(1),
N (S.27)
where the third equality is obtained by applying (S.19) and the forth equality is from (S.20). In
addition, the variance of each component of Hlik is bounded by
a−4 2 ∗ ∗ 4
N N E{δφπ (xi ; θ )kġ(xi ; θ )k }
≤ a−4 ∗ ∗ 4
N N E[{y + (1 − y)ρϕ(x)}pπ (xi ; θ )kġ(xi ; θ )k ]
≤ a−4 ∗ ∗ ∗ 4
N N E[{p(x; θ ) + ρϕ(x)}pπ (xi ; θ )kġ(xi ; θ )k ]
∗ ∗
≤ 2a−4 α
N N e E{e
f (x;β )
kġ(xi ; θ ∗ )k4 } = o(1), (S.28)
where the last inequality is because of (S.26) and the last step if because Assumption 1 implies
∗
that ef (x;β ) kġ(xi ; θ ∗ )k4 is integrable. From (S.27) and (S.28), Chebyshev’s inequality implies that
Hlik → Λlik in probability.
20
S.6 Proof of Theorem 5
∗
Let κ(x) = 1 + cϕ−1 (x)ef (x;β ) . We first notice that
∗
[E{ef (x;β ) }]−1 Vw = M−1 −1 −1 −1
f Mf Mf + Mf cΛsub Mf
∗
= M−1
f E[κ(x)e ġ (x; θ ∗ )]M−1
f (x;β ) ⊗2
f ,
and ∗ ∗ −1
[E{ef (x;β ) }]−1 Vlik = Λ−1
−1
lik = E κ (x)ef (x;β ) ġ ⊗2 (x; θ ∗ )
We only need to show that
∗
[E{ef (x;β ) }]−1 Vw ≥ Λ−1 lik .
This is proved by the following calculation
⊗2
f (x;β ∗ )/2
0 ≤E {κ1/2 (x)M−1 f −κ
−1/2
(x)Λ−1 lik }e ġ(x; θ ∗ )
n o n o
f (x;β ∗ ) ⊗2 f (x;β ∗ ) ⊗2
=E M−1 f κ(x)e ġ (x; θ ∗
)M −1
f + E Λ −1 −1
lik κ (x)e ġ (x; θ ∗
)Λ−1
lik
n ∗
o
− 2E M−1 f e ġ (x; θ ∗ )Λ−1
f (x;β ) ⊗2
lik
∗ ∗
=[E{ef (x;β ) }]−1 Vw + Λ−1 −1
lik − 2Λlik = [E{e }] Vw − Λ−1
f (x;β ) −1
lik .
where dy = dy × dy × ...dy is a product measure and dy is the counting measure. This gives us that
E{U(θ; Dδ )`˙T
lik (θ) | X} = I (S.30)
Hence we have
−1 ˙
V U(θ; Dδ ) − a−2
N Λlik `lik (θ) X ≥ 0
= V{U(θ; Dδ ) | X} + V aN Λlik `˙lik (θ) X − 2a−2 ˙T
−2 −1 −1
N E{U(θ; Dδ )`lik (θ) | X}Λlik
= V{U(θ; Dδ ) | X} + a−2 Λ−1 V{a−1 `˙lik (θ) X}Λ−1 − 2a−2 Λ−1 .
N lik N lik N lik
Taking expectation of the above conditional variance and using (S.25), we have
E[V{U(θ; Dδ ) | X}] + a−2 −1 −1 ˙ −1 −2 −1
N Λlik E[aN V{`lik (θ) X}]Λlik − 2aN Λlik
21
= V{U(θ; Dδ )} + a−2 −1 −2 −1
N Λlik {1 + oP (1)} − 2aN Λlik
= V{U(θ; Dδ )} − a−2 −1
N Λlik {1 + oP (1)}
= V{U(θ; Dδ )} − N1−1 Vlik {1 + oP (1)} ≥ 0,
∗
where the last equality is because (S.1) implies that a−2 −1
N = N1 E{e
f (x;β )
}{1 + oP (1)}. This
finishes the proof.
δiϑ̃
We prove (S.31) first. Let ηiϑ̃ = π%os (x,y;ϑ̃)
{yi − p(xi ; θ ∗ )}ġ(xi ; θ ∗ ). Given ϑ̃, ηiϑ̃ , i = 1, ..., N , are
i.i.d. with mean zero, and the variance satisfies that
V(ηiϑ̃ | ϑ̃)
{yi − p(xi ; θ ∗ )}2
ġ ⊗2 (xi ; θ ∗ ) ϑ̃
=E
{yi + (1 − yi )π%os (xi ; ϑ̃)}
p2 (xi ; θ ∗ ) ⊗2
∗ ∗ 2 ⊗2 ∗ ∗ ∗
= E p(xi ; θ ){1 − p(xi ; θ )} ġ (xi ; θ ) + {1 − p(xi ; θ )} ġ (xi ; θ ) ϑ̃
os
π% (xi ; ϑ̃)
∗ ⊗2 ∗ −1 2 ∗ ⊗2 ∗
≤ E p(xi ; θ )ġ (xi ; θ ) + % p (xi ; θ )ġ (xi ; θ ) ϑ̃
∗ ∗ ∗ ∗
≤ eα E ef (xi ;β ) ġ ⊗2 (xi ; θ ∗ ) + (%−1 eα )e2f (xi ;β ) ġ ⊗2 (xi ; θ ∗ )
22
Since ρ−1 % → cl > 0, for large enough N ,
f (xi ;β∗ ) ⊗2 2f (xi ;β ∗ ) ⊗2
V{a−1 ˙ϑ̃ ∗ ġ (xi ; θ ∗ ) + (1 + c−1 ġ (xi ; θ ∗ ) .
N `w (θ ) | ϑ̃} ≤ E e l c)e
Thus, we know that
V{a−1 ˙ϑ̃ ∗ plt
N `w (θ )} → Λw
Now we check the Lindeberg-Feller condition (Section ∗ 2.8 of van der Vaart, 1998) given ϑ̃. For any
> 0,
N
X
E kηiϑ̃ k2 I(kηiϑ̃ k > aN ) ϑ̃
i=1
For ∆ϑ̃ ϑ̃
`w , the conditional mean E(∆`w | ϑ̃) = 0 and the conditional variance satisfies that
23
kuk4 N
E {1 + %−1 p(x; θ ∗ )}p(x; θ ∗ )kg̈(x; θ ∗ )k2
≤ 4
4aN
kuk4 ∗ ∗ ∗
E {1 + %−1 eα ef (x;β ) }ef (x;β ) kg̈(x; θ ∗ )k2 = o(1),
≤ 2
4aN
so ∆ϑ̃
`w = oP (1).
ϑ̃
For the remainder term Rw . By direct calculations similar to those used for the proof of Theorem 2,
we know that
−1 N N
kuk
ϑ̃ d3 kuk3 eaN X δiϑ̃ dkuk3 X δiϑ̃
|Rw |≤ B(xi ) + 3 yi B(xi )
6N aN i=1
π%os (x, y; ϑ̃) 6aN i=1 π%os (x, y; ϑ̃)
≡ ∆ϑ̃
Rw 1 + ∆ϑ̃
Rw 2 .
By Taylor’s expansion,
N
X
T ˙ϑ̃
ϑ̃
γlik (u) = a−1 ∗ −2
N u `lik (θ ) + 0.5aN φπ (xi ; θ ∗ ){uT ġ(xi ; θ ∗ )}2 + ∆ϑ̃ ϑ̃
`lik + Rlik , (S.35)
i=1
where φϑ̃ ϑ̃ ϑ̃
π (x; θ) = pπ (x; θ){1 − pπ (x; θ)},
eg(x;θ)+l̃
pϑ̃
π (x; θ) = , with ˜li = − log{π%os (x; ϑ̃)}, (S.36)
1 + eg(x;θ)+l̃
N
X
`˙ϑ̃ δiϑ̃ yi − pϑ̃
lik (θ) = π (xi ; θ) ġ(xi ; θ),
i=1
N
1 X ϑ̃ ∗ ∗
∆ϑ̃ δi yi − pϑ̃
T
`lik = 2 π (xi ; θ ) u g̈(xi ; θ )u, (S.37)
2aN i=1
ϑ̃
and Rlik is the remainder term. By direct calculations similar to those used for the proof of Theorem 4,
we know that
N N N
ϑ̃ kuk3 X ϑ̃ ∗ −1 dkuk3 X ϑ̃ (d + 1)kuk3 X ϑ̃
|Rlik |≤ δ i C(x i , θ + a N ú) + δ i yi B(x i ) ≤ δi B(xi ),
6a3N i=1 6a3N i=1 6a3N i=1
24
f (x;β∗ ) −α∗
≤ a−1 B(x) + a−1 −1
ρ)(eα̃ ω̃)−1 E ef (x;θ̃) kġ(x; θ̃)kB(x) ϑ̃
N E e N λ̃min (e
f (x;β∗ ) ∗
≤ a−1 −1 −1
λ̃min (e−α ρ)(eα̃ ω̃)−1 E{B 2 (x)} = oP (1),
N E e B(x) + aN
ϑ̃
where λ̃min is the minimum eigenvalue of M̃f . Thus, Markov’s inequality shows that Rlik = oP (1).
ϑ̃ ϑ̃
For ∆`lik , the conditional mean E(∆`lik | ϑ̃) = 0 and the conditional variance satisfies that
N
kuk4 X ∗ ∗ 2
V(∆ϑ̃
`lik | ϑ̃) ≤ E{δiϑ̃ pϑ̃
π (xi ; θ )kg̈(xi ; θ )k | ϑ̃}
4a4N i=1
kuk4 ∗
≤ E[ef (x;β ) kg̈(x, θ ∗ )k2 ] → 0,
4a2N
so ∆ϑ̃
`lik = oP (1).
and
N
X
ϑ̃
Hlik := a−2
N δiϑ̃ φϑ̃ ∗ ⊗2
π (xi ; θ )ġ (xi ; θ ∗ ) −→ Λplt
lik , (S.39)
i=1
in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that
ϑ̃
aN (θ̂lik − θ ∗ ), the maximizer of γlik
ϑ̃
(u), satisfies that
ϑ̃
aN (θ̂lik − θ ∗ ) = (Λplt
lik )
−1
× a−1 ˙ϑ̃ ∗
N `lik (θ ) + oP (1). (S.40)
Slutsky’s theorem together with (S.38) and (S.40) implies the result in Theorem 1. We prove (S.38)
and (S.39) in the following.
Given ϑ̃, `˙ϑ̃ ∗ ∗ ∗
ϑ̃
ϑ̃
lik (θ ) is a summation of i.i.d. quantities, δi yi − pπ (xi ; θ ) ġ(xi ; θ )’s. Using a
similar approach to obtain (S.18) we know that E{a−1 ˙ϑ̃ ∗
N `lik (θ ) | ϑ̃} = 0. The conditional variance
of a−1 ˙ϑ̃ ∗
N `lik (θ ) satisfies that
V{a−1 ˙ϑ̃ ∗
N `lik (θ ) | ϑ̃}
∗
= e−α E {y + (1 − y)π%os (x; ϑ̃)}{y − pϑ̃ ∗ 2 ⊗2
(x; θ ∗ ) ϑ̃
π (x; θ )} ġ
∗
= e−α E [p(x; θ ∗ ){1 − pϑ̃
π (x; θ ∗ 2
)} + π os
% (x; ϑ̃){1 − p(x; θ ∗
)}p 2
π (x; θ ∗
)]ġ ⊗2
(x; θ ∗
) ϑ̃
∗
= e−α E [{1 − pϑ̃ ∗ 2 ϑ̃ ∗ ∗ 2 ∗ ⊗2
(x; θ ∗ ) ϑ̃
π (x; θ )} + φπ (x; θ ){1 − p(x; θ )} ]p(x; θ )ġ
∗ 2 ∗ ∗ 2 ∗ f (x;β ∗ ) ⊗2
= E [{1 − pϑ̃ ϑ̃
ġ (x; θ ∗ ) ϑ̃ .
π (x; θ )} + φπ (x; θ ){1 − p(x; θ )} ]{1 − p(x; θ )}e
The elements of the term on the right hand side of the above equation are bounded by an integrable
∗
random variable ef (x;β ) kġ(x; θ ∗ )k2 , so if % = o(ρ) then
˙ϑ̃ ∗ plt
V{a−1
N `lik (θ )} −→ Λlik . (S.41)
Using an similar approach as in checking the Lindeberg-Feller condition in the proof of Theorem 4,
we obtain that for any > 0,
N
X h i
∗ ∗ 2 ∗ ∗
E kδiϑ̃ {yi − pϑ̃ ϑ̃ ϑ̃
π (xi ; θ )}ġ(xi ; θ )k I(kδi {yi − pπ (xi ; θ )}ġ(xi ; θ )k > aN ) ϑ̃
i=1
25
h i
∗ ∗ 2 ∗ ∗
= N E δ ϑ̃ k{y − pϑ̃ ϑ̃ ϑ̃
π (x; θ )}ġ(x; θ )k I(δ k{y − pπ (x; θ )}ġ(x; θ )k > aN ) ϑ̃
= o(a2N ),
where the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller
central limit theorem (Section ∗ 2.8 of van der Vaart, 1998), we finish the proof of (S.38).
Now we prove (S.39). By similar derivations in (S.27), we know that
E(H ϑ̃ | ϑ̃) = V{a−1 `˙ϑ̃ (θ ∗ ) | ϑ̃} = Λplt + oP (1).
lik N lik lik (S.42)
where the second equality is from (S.41). In addition, the conditional variance of each component of
ϑ̃
Hlik is bounded by
a−4 ϑ̃ ϑ̃ ∗ 2 ∗ 4
N N E{δ {φπ (xi ; θ )} kġ(xi ; θ )k | ϑ̃}
≤ a−4 os ϑ̃ ∗ ∗ 4
N N E[{y + (1 − y)π% (x; ϑ̃)}pπ (xi ; θ )kġ(xi ; θ )k | ϑ̃]
≤ a−4 ∗ os ϑ̃ ∗ ∗ 4
N N E[{p(x; θ ) + π% (x; ϑ̃)}pπ (xi ; θ )kġ(xi ; θ )k | ϑ̃]
∗ ∗ ∗
≤ 2a−4 α
N N e E{e
f (x;β )
kġ(xi ; θ ∗ )k4 } = 2a−2
N E{e
f (x;β )
kġ(xi ; θ ∗ )k4 } = o(1). (S.43)
ϑ̃
From (S.42) and (S.43), Chebyshev’s inequality implies that Hlik → Λplt
lik in probability.
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.4: Degree of perturbation ξ = 0.1. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
26
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.5: Degree of perturbation ξ = 0.5. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.6: Degree of perturbation ξ = 1.0. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.7: Degree of perturbation ξ = 0.1. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.8: Degree of perturbation ξ = 0.5. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
27
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.9: Degree of perturbation ξ = 1.0. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
This section studies the case when model is mis-specified. That is, the ground truth model fall out
of the domain of linear logistic regression model. We design the ground truth model as a two-layer
neural network with non-linear activations:
1
g(x; θ) = α + tanh(ξ · xT W)β, (S.45)
ξ
where ξ is a constant to control the “cut-off" threshold of the non-linear activation tanh. Note that as
ξ → 0, the scaled tanh function 1ξ · tanh(ξ · x) ≈ x as shown in Figure S.10. Thus, by tuning ξ we
have the control of the deviation of scaled tanh function from a linear, identity transformation.
8 =1
= 0.5
6 = 0.3
= 0.1
4
2
x)
1 tanh(
8
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
Figure S.10: Scaled tanh function approximates identity transformation when ξ → 0.
When ξ and ξw are extremely small, g(x; θ) ≈ α + xT β. Empirical results shown in Figure S.11
match results without model mis-specification.
28
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.11: Degree of non-linearity ξ = 10−3 , ξw = 10−3 . Ground-truth intercepts: (a) α = −7.65,
(b) α = −0.5, (c) α = −7, (d) α = −1.8. Log (MSEs) of subsample estimators for different sample
sizes (the smaller the better).
We fix ξw = 10−3 and tune ξ to study the effect of non-linear activation. Note that when ξ → 0,
1 1
ξ tanh(ξx) ≈ x and when ξ → ∞, ξ tanh(ξx) ≈ 0. As we shall see, the value ξ significantly affect
the non-linearity of the model. In Figure S.12, we choose a relatively small ξ = 0.1 and observe an
almost indistinguishable loss in log(MSE). In Figure S.13 we study the case when ξ = 0.5 and find
that it significantly affect the model estimation when x has long tails (Figure S.13(b), (d)). When
ξ = 1, all method crash since the true (non-linear) model deviate from the tractable domain (linear).
In order to avoid this happen, we suggest to re-scale and centering the data before feeding them to
the model. Note that when models are not crashed, optLik still achieves best performance among all
methods.
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.12: Degree of non-linearity ξ = 10−1 , ξw = 10−3 . Ground-truth intercepts: (a) α = −7.7,
(b) α = −0.6, (c) α = −6.9, (d) α = −1.9. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.13: Degree of non-linearity ξ = 5·10−1 , ξw = 10−3 . Ground-truth intercepts: (a) α = −7.4,
(b) α = −0.95, (c) α = −6.7, (d) α = −2.15. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
29
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
Figure S.14: Degree of non-linearity ξ = 1.0, ξw = 10−3 . Ground-truth intercepts: (a) α = −7.05,
(b) α = −1.75, (c) α = −6.4, (d) α = −2.85. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
We fix ξ = 10−3 and focus on studying the effect of generative process of W to the estimation error.
In Figure S.15, we set ξw = 2·10−2 to be a relatively small number such that kW −Ik2→2 ≈ 0.20 <
1. In this case, each sampling method suffers from model mis-specification with higher log(MSE).
Still optLik is the best among all generative processes of x. In Figure S.16, we set ξw = 10−1 and
kW − Ik2→2 ≈ 1. In this case, the noise level is comparable to the ground-truth. All sampling
models crash such that the estimated models deviate significantly from the ground-truth.
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
−3 −2
Figure S.15: Degree of non-linearity ξ = 10 , ξw = 2·10 . Ground-truth intercepts: (a) α = −7.7,
(b) α = −0.5, (c) α = −6.9, (d) α = −1.9. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
(a). x’s are normal (b) x’s are lognormal (c) x’s are t3 (d) x’s are exponential
−3 −1
Figure S.16: Degree of non-linearity ξ = 10 , ξw = 10 . Ground-truth intercepts: (a) α = −8.25,
(b) α = −0.25, (c) α = −6.6, (d) α = −2.4. Log (MSEs) of subsample estimators for different
sample sizes (the smaller the better).
References
Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data Mining and
Knowledge Discovery Handbook, 875–886. Springer.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from
imbalanced data sets. ACM SIGKDD Explorations Newsletter 6, 1, 1–6.
Chen, J., Sun, B., Li, H., Lu, H., and Hua, X.-S. (2016). Deep ctr prediction in display advertising.
In Proceedings of the 24th ACM International Conference on Multimedia (MM), 811–820.
30
Cheng, G., Huang, J. Z., et al. (2010). Bootstrap consistency for general semiparametric m-estimation.
The Annals of Statistics 38, 5, 2884–2915.
Douzas, G. and Bacao, F. (2017). Self-organizing map oversampling (somo) for imbalanced data set
learning. Expert Systems with Applications 82, 40–52.
Drummond, C., Holte, R. C., et al. (2003). C4. 5, class imbalance, and cost sensitivity: why under-
sampling beats over-sampling. In Workshop on Learning from Imbalanced Datasets II, vol. 11,
1–8. Citeseer.
Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from
imbalanced data sets. Computational intelligence 20, 1, 18–36.
Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced
data sets. Annals of statistics 42, 5, 1693.
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). Deepfm: a factorization-machine based neural
network for ctr prediction. arXiv preprint arXiv:1703.04247 .
Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in
imbalanced data sets learning. In D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds., Advances in
Intelligent Computing, 878–887, Berlin, Heidelberg. Springer Berlin Heidelberg.
Han, L., Tan, K. M., Yang, T., Zhang, T., et al. (2020). Local uncertainty sampling for large-scale
multiclass logistic regression. Annals of Statistics 48, 3, 1770–1788.
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al.
(2014). Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth
International Workshop on Data Mining for Online Advertising, 1–9.
Hesterberg, T. (1995). Weighted average importance sampling and defensive mixture distributions.
Technometrics 37, 2, 185–194.
Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers of convex processes. arXiv preprint
arXiv:1107.3806 .
Huang, J.-T., Sharma, A., Sun, S., Xia, L., Zhang, D., Pronin, P., Padmanabhan, J., Ottaviano, G., and
Yang, L. (2020). Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, 2553–2561.
Japkowicz, N. (2000). Learning from imbalanced data sets: Papers from the AAAI workshop, AAAI,
2000. Technical Report WS-00-05.
Kilbertus, N., Parascandolo, G., and Schölkopf, B. (2018). Generalization in anti-causal learning.
arXiv preprint arXiv:1812.00524 .
King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis 9, 2, 137–163.
Lemaître, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle
the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research
18, 1, 559–563.
Liu, X., Wu, J., and Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2, 539–550.
Mathew, J., Pang, C. K., Luo, M., and Leong, W. H. (2017). Classification of imbalanced data by
oversampling in kernel space of support vector machines. IEEE Transactions on Neural Networks
and Learning Systems 29, 9, 4065–4076.
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov,
E., Golovin, D., et al. (2013). Ad click prediction: a view from the trenches. In Proceedings
of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
1222–1230.
31
Owen, A. and Zhou, Y. (2000). Safe and effective importance sampling. Journal of the American
Statistical Association 95, 449, 135–143.
Owen, A. B. (2007). Infinitely imbalanced logistic regression. The Journal of Machine Learning
Research 8, 761–773.
Pukelsheim, F. (2006). Optimal design of experiments. SIAM.
Rahman, M. M. and Davis, D. (2013). Addressing the class imbalance problem in medical datasets.
International Journal of Machine Learning and Computing 3, 2, 224.
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On causal and
anticausal learning. In ICML 2012: Proceedings of the 29th International Conference on Machine
Learning, Edinburgh, Scotland, June 26-July 1, 2012, 1255–1262.
Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification
of imbalanced data. Pattern Recognition 40, 12, 3358–3378.
Ting, D. and Brochu, E. (2018). Optimal subsampling with influence functions. In Advances in
neural information processing systems, 3650–3659.
van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal
of Machine Learning Research 20, 132, 1–59.
Wang, H. (2020). Logistic regression for massive data with rare events. In H. D. III and A. Singh, eds.,
Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings
of Machine Learning Research, 9829–9836. PMLR.
Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika
108, 1, 99–112.
Wang, H., Zhu, R., and Ma, P. (2018). Optimal subsampling for large sample logistic regression.
Journal of the American Statistical Association 113, 522, 829–844.
Xiong, S. and Li, G. (2008). Some results on the convergence of conditional distributions. Statistics
& Probability Letters 78, 18, 3249–3253.
Yu, J., Wang, H., Ai, M., and Zhang, H. (2020). Optimal distributed subsampling for maximum
quasi-likelihood estimators with massive data. Journal of the American Statistical Association 0, 0,
1–12.
32