Probability and Non-Probability Samples LMU Munich 2022
Probability and Non-Probability Samples LMU Munich 2022
Probability and Non-Probability Samples LMU Munich 2022
Gerhard Tutz
Ludwig-Maximilians-Universität München
Akademiestraße 1, 80799 München
April 5, 2022
Abstract
Non-probability sampling, for example in the form of online panels, has
become a fast and cheap method to collect data. While reliable inference
tools are available for classical probability samples, non-probability sam-
ples can yield strongly biased estimates since the selection mechanism is
typically unknown. We propose a general method how to improve statis-
tical inference when in addition to a probability sample data from other
sources, which have to be considered non-probability samples, are avail-
able. The method uses specifically tailored regression residuals to enlarge
the original data set by including observations from other sources that can
be considered as stemming from the target population. Measures of ac-
curacy of estimates are obtained by adapted bootstrap techniques. It is
demonstrated that the method can improve estimates in a wide range of
scenarios. For illustrative purposes, the proposed method is applied to two
data sets.
1
1 Introduction
In the survey research literature there is an ongoing discussion on how to use
non-probability sample surveys to obtain better estimates. Statistical analysis of
non-probability survey samples as, for example, web based surveys faces many
challenges since the selection mechanism for non-probability samples is typically
unknown and treating non-probability samples as if they were a simple random
sample often leads to biased results (Baker et al., 2013; Kim et al., 2021). Proba-
bility sampling theory provides a strong theoretical basis for the computation of
the accuracy of estimates in probability samples. For non-probability sample sur-
veys no such framework is available. Typically, untested modeling assumptions
have to be made that reflect how the sample differs from the population of inter-
est (Cornesse et al., 2020). Forms of assumptions include quasi-randomization or
superpopulation modeling (Jean-Claude, 1991; Elliott and Valliant, 2017). One
alternative strategy is based on sample matching. It is assumed that a proba-
bility sample is selected from the target population, the sample must contain a
set of auxiliary variables that is also available in the non-probability sample and
can be used to match samples, see, for example, Bethlehem (2016). A review
on conceptual approaches has been given by Cornesse et al. (2020), see also Kim
et al. (2021), Wiśniowski et al. (2020), Elliott and Valliant (2017), Chen et al.
(2020).
In survey methodology the focus typically is on the estimation of the pop-
ulation mean of the target population. Similar as in missing data problems
assumptions on the probability to be included in the non-probability are made
with covariates providing the required auxiliary information that is used to find
weights in imputation estimators, see, for example, Kim et al. (2021). The ap-
proach considered here is fundamentally different. The focus is not on the mean
of populations but on regression models. We investigate how the estimation of
regression coefficients in a target population can be improved by including data
from non-probability samples. As in many approaches in survey methodology
(see, for example, Vavreck and Rivers (2008); Lee and Valliant (2009); Brick
(2015)) we assume that a probability sample of responses and covariates is avail-
able. The same covariates are observed in the non-probability sample, which
is not representative of the target population due to the unknown sample se-
lection mechanism. Diagnostic tools for regression models are used to decide if
observations from the non-probability sample should be included in the analysis.
In Section 2 first residuals in regression are briefly sketched, then it is shown
how they can be used to improve estimates of regression coefficients by enlarging
the original data set. In Section 3 some simulation results are given. In Section
4 a bootstrap based concept how to obtain reliable estimates of standard errors
is given. Some applications are found in Section 5.
2
2 Improving Estimates
In the following it is assumed that two samples are available:
3
√
H = (hij ) the variance var(ri ) = σ 2 / 1 − hii . Scaling of residuals to the same
variance produces the form
ri
r̃i = √ ,
1 − hii
with var(r̃i ) = σ 2 . If, in addition, one divides by the root of the estimated
variance σ̂ 2 = (r T r)/(n − p − 1), where p + 1 is the length of xi , one obtains the
studentized residual
r̃i yi − µ̂i
ri∗ = = √ ,
σ̂ σ̂ 1 − hii
which behaves much like a Student’s t random variable except for the fact that
the numerator and denominator are not independent.
An alternative useful approach is based on case deletion. Let β̂ (i) denote the
least-squared estimate resulting from the dataset in which observation (yi , xi ) is
omitted. The change in β that results if the i-th observation is omitted may be
measured by
∆i β̂ = β̂ − β̂ (i) .
For the linear model no refitting is needed since an explicit form is available given
by ∆i β̂ = (X T X)−1 xi ri /(1 − hii ), see, for example, Cook and Weisberg (1982).
(1) Fit the regression model yi = xTi β + εi by using the probability sample to
obtain β̂ 0 .
(2) Fit the regression model to data S0 ∪{(yi , xi )}, where (yi , xi ) ∈ S N P , that is,
one observation from the non-probability sample is added to the probability
∗
sample, to obtain β̂ (i) and the corresponding studentized residual rs,i
kβ̂ 0 − β̂ i k
∆i = < tc ,
kβ̂ 0 k
4
where tc is a threshold for the change in parameters when observation
(yi , xi ) is included.
(4) All observations that fulfill both criteria are added to the probability sample
to obtain the extended sample, which is used to compute the final estimate.
The first criterion uses the studentized residual in the enlarged sample S0 ∪
{(yi , xi )}. If the observation (yi , xi ) is from the target population its studentized
residual should not be too large. The second criterion is based on case deletion
but case deletion in the enlarged sample S0 ∪ {(yi , xi )}. It compares parameter
estimates in the probability sample with estimates in the sample enlarged by one
observation. Thus it is case deletion regarding the enlarged sample but can also
be seen as case addition regarding the original probability sample. When adding
the i-th observation the change in parameters should not be too large if it is from
the target population.
Choice of Thresholds
The procedure depends on the choice of the thresholds ts and tc , which we link to
quantiles. Given a nominal error level αst , we choose for the studentized residuals
ts = q1−αst , where q1−αst denotes the (1 − αst )- quantile of the standard normal
distribution, in the applications we use αst = 0.05.
The choice of the threshold for the change in parameters is based on the
distribution of changes in parameters that are to be expected in a sample from the
target population. In the probability sample, which is from the target population,
there is natural variation in estimates if one observation is omitted. We explicitly
investigate this distribution in the following way:
(a) For all observations from the probability sample fit the model by deleting
one observation at a time. Fitting the model by using observations S0
without {(yi , xi )}, that is, without the i-th observation , yields β̂ −i
kβ̂ 0 − β̂ −i k
chi = .
kβ̂ 0 k
5
The method to enlarge the data set uses two tuning parameters αst and αch .
If αst and αch are small many observations from the non-probability sample are
included in the extended data set, if they are large the number of included obser-
vations is small. Thus, the α-values are an indicator for the size of the extended
sample, with increasing α-values the extended sample gets smaller and in the
extreme case consists of the original probability sample.
It should be noted that the α-values are not significance levels in the tradi-
tional sense. They are tuning parameters, but it seems quite natural to describe
them in distributional terms. As tuning parameters they can also be chosen in
a data-driven way, for example, by cross-validation. For simplicity one can also
use αst = αch .
Cross-Validation
The tuning parameters αst and αch can be chosen as fixed values. An alternative
is to choose them in a data driven way by cross-validation. Then the probability
sample is split randomly into k distinct data sets S01 , . . . , S0k of approximately
equal size. One considers S0 \ S0j as learning data set, which is used to obtain an
enlarged sample, and S0j as validation sample, for which, based on the extended
sample, predicted values are computed. The discrepancy between the predicted
values and the observed values in S0j shows how well the procedure works in
terms of prediction. Computing the discrepancy for all sub samples S0j and
averaging yields a measure for the performance of fixed tuning parameters αst
and αch . Computation of the discrepancies on a grid of α values allows to choose
the values which show best performance. As discrepancy measure one can use
the squared error and α values from a grid of values. One can compute on the
whole grid or use a less demanding cross-validation method that uses a reduced
grid of values with αst = αch .
6
on explanatory variables only. This seems sensible if one is interested in the
impact of covariates only. If, however, one aims at prediction the whole vector of
coefficients makes more sense. The choice is of minor importance since we found
that the improvement obtained by the enlargement of the sample is very similar
in both types of parameter vectors.
Further measures of performance refer to the identification of observations in
the non-probability sample. Similar as in classification problems one can consider
3 Some Simulations
3.1 One Predictor
For illustration we consider briefly the case of just one predictor with fixed param-
eters. In a small simulation study we compare the estimators based on the prob-
ability sample and the extended sample. The simulation uses n = 40, n1 = 200,
n2 = 200. The distribution of covariates and the parameters are specified in the
following way.
Figure 1 shows the results for αst = αch = 0.05 (setting (a)). The upper left
picture shows the MSEs for the probability sample estimator (PSE, light) and the
7
estimator based on the extended sample (ExtE, dark). It is seen that the MSE of
the PSE (light blue) is distinctly smaller than the MSE of the ExtE (dark blue).
This is also seen from the upper right picture which shows the relative mean
squared errors. The inclusion of observations from the non-probability sample
yields strongly improved estimates. The second row shows the proportions of
correctly and incorrectly identified observations in the 50 simulated data sets
(left) and the observations of both variables for the first simulated data set (right).
It is seen that the number of observations from the non-probability sample that
are correctly identified as coming from the target population is always large, the
number of incorrectly assumed as coming from the target population is between
0.2 and .35. But as the distribution of the response and the predictor in the
right picture shows incorrectly identified observations do not necessarily affect
the accuracy of the estimator, it is only essential to exclude all observations that
might include bias. The last row shows the box plots of the estimates for the two
parameters β0 and β0 with the horizontal line indicating the true value. It is seen
that in particular the variability of estimates is smaller for the extended sample
denoted by regression extended.
8
n1= 200 n2= 200
6
0.6
Relative norm
4
method
MSE
0.4 1
2
2
0.2
0.0 0
40 40
n=40 Setting n=40
lab
Rates of identified observations across simulations NP1 NP2 P
0.35
6
4
Incorrectly identified
0.30
2
y
0
0.25
−6 −4 −2
0.20
Variable= 0 Variable= 1
1.4
1.4
1.2
1.2
1.0
0.8
1.0
0.6
0.8
0.4
Figure 2 shows the corresponding pictures for setting (b). Although the num-
ber of incorrectly identified observations is much larger than in setting (a) the
improvement in terms of MSE is very strong. The reason is that the observations
that are incorrectly included do not distort the estimate since the slope in the pol-
luted sample differes from the slope in the target population but not as strongly
as in setting (a), where the signs of the slopes differ. For further illustrations see
Appendix.
9
n1= 200 n2= 200
0.6 10.0
Relative norm
7.5
method
MSE
0.4
1
2
5.0
0.2
2.5
0.0
40 40
n=40 Setting n=40
lab
Rates of identified observations across simulations NP1 NP2 P
10
0.50
Incorrectly identified
8
6
y
0.40
4
2
0.30
0
−2
Variable= 0 Variable= 1
1.4
1.3
1.2
1.1
1.0
0.8
0.9
0.6
0.4
0.7
10
the regression parameters in the polluted sample are determined as deviations
from the parameters in the probability sample, βi = β0i + σpar ∗ N (0, 1), with σpar
determining the strength of the shift. We consider the following settings.
Figure 3 shows the mean squared errors of setting 1 with four explanatory
variables for varying numbers of observations n1 , n2 . when the number of obser-
vations in the probability sample increases from 40 to 200. For simplicity we
used αst = αch = 0.05, choice by cross-validation yielded very similar α values
and performance typically was comparable. It is seen that the MSEs for both
the probability sample estimator (PSE) and the estimator based on the extended
sample (ExtE) decrease with increasing sample sizes. The ExtE performs con-
sistently better than the PSE. Only in the case where there is not much to gain
from using the observations in the non-probability sample (n1 = 200) and there
are many polluted observations (n2 = 800) the dominance becomes weaker, in
particular for large probability sample sizes.
11
n1 =200, n2=400 n1 =400, n2=400
1.00
0.75
0.75
MSE
1 1
0.50 2 2
0.25
0.25
0.00 0.00
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample
n1= 200 n2= 800 n1 =400, n2=800
1.00
0.75
0.75
method method
MSE
MSE
1 0.50 1
0.50 2 2
0.25 0.25
0.00
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample
The left picture in Figure 4 shows the comparison of estimates for varying
numbers of observations from the target population in the non-probability sample.
It is seen that the increased availability of observations from the target population
decreases the error in the extended sample. The right picture shows the effects
of varying numbers of observations in the polluted data sets. It is seen that
increasing the number of polluted observations hardly affects the performance of
the estimates in the extended sample.
n =40, n2=400 n =80, n1=200
0.8
0.9
0.6
method method
MSE
MSE
0.6
1 1
0.4
2 2
0.3
0.2
0.0
100 150 200 250 300 200 300 400 500 600
Number of observations from target in non−probability sample Number of observations in non−probability sample
12
Figure 5 shows the comparisons for varying numbers of observations in prob-
ability sample for setting 2 with stronger variation in the polluting sample. The
improvement of estimates is similar to that seen in setting 1.
n1 =200, n2=400 n1 =200, n2=400
0.75
0.9
method method
MSE
MSE
0.6 1
0.50 1
2 2
0.3 0.25
0.0
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample
0.9 0.9
method method
MSE
MSE
0.6 1 0.6 1
2 2
0.3 0.3
4 Standard Errors
Standard errors for parameter estimates of linear models are easily obtained
P by
using cov(β̂) = σ 2 (X T X)−1 and replacing σ 2 by the estimator σ̂ 2 = i (yi −
xTi β̂)2 /(n − (p + 1)). When using the extended sample it is tempting to use
the same estimator with the extended design matrix. However, the estimator
is severely biased since it ignores the process that augments the data, it simply
13
assumes that the extended data set is a probability sample. The larger number
of observations creates the illusion that standard errors are much smaller than
they actually are.
A way to obtain reliable estimates is bootstrapping (Efron and Tibshirani,
1994; Davison and Hinkley, 1997). For a given data set consisting of the proba-
bility sample and the non-probability sample, the whole estimation procedure is
carried out repeatedly for data that are obtained by drawing observations from
the data set with replacement. The variation of the estimates is used to compute
realistic standard errors of parameters.
An approximation to the true standard errors is obtained by repeating the
estimation procedure for nrep = 100 drawings for fixed parameters. To obtain
bootstrap standard errors, we used nboost = 100 bootstrap repetitions. We show
the results for the following selected simulation settings.
Table 1 shows the resulting estimates. Prob sample denotes the standard
error in the probability sample, Non-prob naive are the standard errors obtained
as estimates of standard errors when fitting in the extended sample but inoring
that the sample has been extended. They are only shown to demonstrate that
these estimates should not be used. Non-prob actual denotes the approximated
actual standard errors and Non-prob boost the bootstrap estimates. It is seen that
the nominal standard errors Non-prob naive strongly underestimate the actual
standard error. The boosting errors are comparable to the actual errors but tend
to be slightly larger, which makes them somewhat conservative when used in
testing. Standard errors in the extended sample are smaller than the standard
errors in the probability sample, which typically is also seen in the bootstrap
estimates although they are slightly larger than the actual errors. However,
14
though bootstrap standard errors are not much smaller than standard errors in
the probability sample, estimates in the extended sample are more precise as has
been demonstrated in the simulations in Section 3.
β0 β1 β2 β3 β4
Setting B1 Prob sample 0.252 0.133 0.133 0.162 0.163
Non-prob naive 0.090 0.053 0.052 0.053 0.052
Non-prob actual 0.191 0.102 0.111 0.124 0.115
Non-prob boost 0.224 0.130 0.143 0.133 0.134
Setting B2 Prob sample 0.552 0.301 0.275 0.336 0.289
Non-prob naive 0.175 0.100 0.098 0.099 0.099
Non-prob actual 0.368 0.199 0.202 0.213 0.192
Non-prob boost 0.436 0.267 0.275 0.284 0.258
Setting B3 Prob sample 0.269 0.164 0.152 0.158 0.166
Non-prob naive 0.085 0.056 0.055 0.055 0.056
Non-prob actual 0.205 0.107 0.106 0.109 0.115
Non-prob boost 0.257 0.154 0.147 0.152 0.155
15
the data sets.
In the second part of Figure 2 the same combinations of probability and non-
probability samples are considered but now we do not use the full probability
samples to estimate parameters. Since these are real data sets it is unknown
which estimates are better, the ones for the probability sample or the ones based
on the extended samples. Estimates in the probability sample could suffer from
outliers. Therefore, for a fixed probability sample we consider which observations
would be included if the same sample is used as non-probability sample. Then
we use only the observations that are considered as coming from the probability
sample to obtain a reduced probability sample. For the district “Inner City” from
the original 47 observations only 43 were included in the reduced data set. It is
seen that this can distinctly change the estimates. For example, the estimate of
floor, which is an important value since it means how much one has to pay per
square meter if it is accounted for the number of rooms, is 11.21 in the original
sample but 10.46 in the reduced sample, which makes a difference for people
renting flats.
The estimate in the reduced sample can be seen as a robustified estimator
since outliers are excluded. As estimator of the standard error in this case the
bootstrap estimator is given since the naive estimator is certainly biased. Table
3 shows how the samples are reduced if outliers are omitted. ’Kept’ refers to the
number of observations that are kept in the sample and ’deleted’ to the number
of observations that are left out in the robustified estimator.
Table 2: Munich rent data for various districts and enlarged data sets by in-
cluding neighborhood districts (standard errors obtained by bootstrap).
Parameters Std err
available used
observations observations area rooms std area std rooms
Inner city 47 47 11.21 -92.56 1.866 58.072
Enlarged by Bogenhausen 206 198 11.22 -64.87 1.569 43.937
Enlarged by Schwabing 212 203 10.21 -53.22 1.785 49.880
Schwabing 165 165 11.16 -42.36 1.119 31.581
Enlarged by Maxvorstadt 333 324 10.40 -26.64 1.203 25.907
Enlarged by Bogenhausen 324 314 11.27 -46.03 1.271 24.801
Inner city reduced 47 43 10.46 -97.99 2.187 66.978
enlarged by Bogenhausen 206 191 10.75 -69.64 1.413 42.952
enlarged by Schwabing 212 194 9.87 -59.18 1.524 43.958
Schwabing reduced 165 151 10.11 -46.08 2.050 63.679
enlarged by Maxvorstadt 333 319 9.62 -30.84 1.083 22.314
enlarged by Bogenhausen 324 290 10.35 -44.74 1.111 22.838
16
Table 3: Munich rent prob and non prob same
17
Table 4: Fears data for various countries and enlarged data sets by including
neighborhood countries (standard errors obtained by bootstrap).
Parameters Std err
available used
obs obs Age Gender Abitur Age Gender Abitur
Bavaria 317 317 0.010 0.566 -0.294 0.005 0.193 0.209
enlarged by BW 522 507 0.012 0.619 -0.273 0.004 0.166 0.206
enlarged by BW, Hessen 837 795 0.014 0.600 -0.226 0.004 0.165 0.197
Berlin 62 62 0.007 0.816 0.148 0.011 0.452 0.450
enlarged by Brand 208 189 0.011 0.572 0.567 0.010 0.429 0.413
enlarged by Brand, Saxony 380 341 0.014 0.703 0.246 0.010 0.435 0.404
Table 5: Fears data for various countries and enlarged data sets by including
neighborhood countries, robustified estimates by initially reducing the probability
sample.
Parameters Std err
available used
obs obs age gender std age std gender
Bavaria all obs 317 317 0.012 0.586 0.005 0.193
Bavaria reduced 317 290 0.020 0.874 0.004 0.171
enlarged by BW 495 464 0.023 0.850 0.003 0.168
enlarged by BW, Hessen 632 583 0.024 0.878 0.003 0.166
6 Concluding Remarks
It has been demonstrated that the inclusion of selected observations from a differ-
ent sample can strongly improve parameter estimation in regression modelling.
The model that has been considered was the classical linear model with least
squares estimates. The basic methodology, including observations based on the
investigation of residuals, can also be used if alternative estimators are used.
In regression with many explanatory variables nowadays shrinkage estimators
and other regularized estimators (Tibshirani, 1996; Hastie et al., 2009) are in
common use. It is straightforward to use residuals for these estimators to se-
lect observations. The approach could also be used in the fitting of extended
models as additive models, generalized additive models (Hastie and Tibshirani,
18
1990; Wood, 2017) and generalized linear models (McCullagh and Nelder, 1983;
Fahrmeir and Tutz, 2001) by using appropriate residuals.
The method to enlarge the probability sample used here is an in-or-out pro-
cedure. In future research one might alternatively compute weights based on
residuals and change in parameters, and put weights on observations in the ex-
tended sample. Observations in the probability sample might get weights 1 but
observations that are included get smaller weights depending on the values of
residual measures.
References
Baker, R., J. M. Brick, N. A. Bates, M. Battaglia, M. P. Couper, J. A. Dever, K. J.
Gile, and R. Tourangeau (2013). Summary report of the aapor task force on
non-probability sampling. Journal of survey statistics and methodology 1 (2),
90–143.
Bethlehem, J. (2016). Solving the nonresponse problem with sample matching?
Social Science Computer Review 34 (1), 59–77.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York:
Springer–Verlag.
Brick, J. M. (2015). Compositional model inference. JSM Proceedings, Survey
Research Methods Section, 299–307.
Chen, Y., P. Li, and C. Wu (2020). Doubly robust inference with nonprobability
survey samples. Journal of the American Statistical Association 115 (532),
2011–2021.
Cook, R. D. and S. Weisberg (1982). Residuals and Influence in Regression.
London: Chapman & Hall.
Cornesse, C., A. G. Blom, D. Dutwin, J. A. Krosnick, E. D. De Leeuw, S. Legleye,
J. Pasek, D. Pennay, B. Phillips, J. W. Sakshaug, et al. (2020). A review of
conceptual approaches and empirical evidence on probability and nonprobabil-
ity sample survey research. Journal of Survey Statistics and Methodology 8 (1),
4–36.
Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and Their Appli-
cation. Cambridge: Cambridge University Press.
Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap, Vol-
ume 57. CRC press.
Elliott, M. R. and R. Valliant (2017). Inference for nonprobability samples.
Statistical Science 32 (2), 249–264.
19
Fahrmeir, L., T. Kneib, S. Lang, and B. Marx (2013). Regression models.
Springer.
Lee, S. and R. Valliant (2009). Estimation for volunteer panel web surveys using
propensity score adjustment and calibration adjustment. Sociological Methods
& Research 37 (3), 319–343.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society B 58, 267–288.
20
Appendix
Illustration Univariate Predictor
Settings (a) and (b) were considered in the text, now we consider one additional
setting:
0.6
6
Relative norm
0.4 method
MSE
1 4
2
0.2
2
0.0
40 40
n=40 Setting n=40
lab
Rates of identified observations across simulations NP1 NP2 P
0.30
5
Incorrectly identified
0.25
0
y
0.20
−5
0.15
−10
Variable= 0 Variable= 1
1.6
1.2
1.4
1.2
1.0
1.0
0.8
0.8
0.6
0.6
21
Further Simulations Multivariate Predictor
We consider again Setting 1 with four predictors but now the noise in the prob-
ability sample is larger than in the polluted sample, var(ε) = 4 in probability
sample, var(ε) = 1 in polluted sample. In this case the choice αst = αch = 0.05
turns out to perform less convincing. Therefore we used cross-validation to obtain
the pictures in . The average values of αs (αst = αch ) chosen by cross validation
for the five settings was 0.155, 0.141, 0.203, 0.190, 0.193. Choice by cross valida-
tion on the full grid of α values from the grid 0.3, 0.2, 0.1, 0.05 showed no further
improvement.
n1= 200 n2= 200 n1= 200 n2= 400
2.0
1.5
1.5
method method
MSE
MSE
1.0 1 1.0 1
2 2
0.5 0.5
0.0
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample
Figure 8: Simulation setting 1 with noise in the probability sample larger than
in the polluted sample, cross-validated.
22