Probability and Non-Probability Samples LMU Munich 2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Probability and Non-Probability Samples:

Improving Regression Modeling by Using Data


from Different Sources
arXiv:2204.01110v1 [stat.ME] 3 Apr 2022

Gerhard Tutz
Ludwig-Maximilians-Universität München
Akademiestraße 1, 80799 München

April 5, 2022

Abstract
Non-probability sampling, for example in the form of online panels, has
become a fast and cheap method to collect data. While reliable inference
tools are available for classical probability samples, non-probability sam-
ples can yield strongly biased estimates since the selection mechanism is
typically unknown. We propose a general method how to improve statis-
tical inference when in addition to a probability sample data from other
sources, which have to be considered non-probability samples, are avail-
able. The method uses specifically tailored regression residuals to enlarge
the original data set by including observations from other sources that can
be considered as stemming from the target population. Measures of ac-
curacy of estimates are obtained by adapted bootstrap techniques. It is
demonstrated that the method can improve estimates in a wide range of
scenarios. For illustrative purposes, the proposed method is applied to two
data sets.

Keywords: Probability samples; non-probability samples; regression; residual


analysis; inference.

1
1 Introduction
In the survey research literature there is an ongoing discussion on how to use
non-probability sample surveys to obtain better estimates. Statistical analysis of
non-probability survey samples as, for example, web based surveys faces many
challenges since the selection mechanism for non-probability samples is typically
unknown and treating non-probability samples as if they were a simple random
sample often leads to biased results (Baker et al., 2013; Kim et al., 2021). Proba-
bility sampling theory provides a strong theoretical basis for the computation of
the accuracy of estimates in probability samples. For non-probability sample sur-
veys no such framework is available. Typically, untested modeling assumptions
have to be made that reflect how the sample differs from the population of inter-
est (Cornesse et al., 2020). Forms of assumptions include quasi-randomization or
superpopulation modeling (Jean-Claude, 1991; Elliott and Valliant, 2017). One
alternative strategy is based on sample matching. It is assumed that a proba-
bility sample is selected from the target population, the sample must contain a
set of auxiliary variables that is also available in the non-probability sample and
can be used to match samples, see, for example, Bethlehem (2016). A review
on conceptual approaches has been given by Cornesse et al. (2020), see also Kim
et al. (2021), Wiśniowski et al. (2020), Elliott and Valliant (2017), Chen et al.
(2020).
In survey methodology the focus typically is on the estimation of the pop-
ulation mean of the target population. Similar as in missing data problems
assumptions on the probability to be included in the non-probability are made
with covariates providing the required auxiliary information that is used to find
weights in imputation estimators, see, for example, Kim et al. (2021). The ap-
proach considered here is fundamentally different. The focus is not on the mean
of populations but on regression models. We investigate how the estimation of
regression coefficients in a target population can be improved by including data
from non-probability samples. As in many approaches in survey methodology
(see, for example, Vavreck and Rivers (2008); Lee and Valliant (2009); Brick
(2015)) we assume that a probability sample of responses and covariates is avail-
able. The same covariates are observed in the non-probability sample, which
is not representative of the target population due to the unknown sample se-
lection mechanism. Diagnostic tools for regression models are used to decide if
observations from the non-probability sample should be included in the analysis.
In Section 2 first residuals in regression are briefly sketched, then it is shown
how they can be used to improve estimates of regression coefficients by enlarging
the original data set. In Section 3 some simulation results are given. In Section
4 a bootstrap based concept how to obtain reliable estimates of standard errors
is given. Some applications are found in Section 5.

2
2 Improving Estimates
In the following it is assumed that two samples are available:

S0 = {(yi ; xi ), i = 1, . . . , n} is a random sample from the target population,


called the probability sample,

S N P = S1 ∪S2 is a non-probability sample with S1 = {(yi ; xi ), i = 1, . . . , n1 }


from the target population, and S2 = {(yi ; xi ), i = 1, . . . , n2 } being a pol-
luted sample.

In the non-probability sample it is not known which observations are from


the target population and which are from the polluted sample. While the ob-
servations S1 from the non-probability sample can be used to improve estimates
of regression coefficients, inclusion of observations from sample S2 might yield
distorted estimates since the distribution of responses and covariates in the pop-
ulation from which sample S2 is drawn might differ strongly from the distribution
in the target population. The primary aim is to identify which observations from
S N P can be considered target population observations. It is basically a classifi-
cation problem but without the learning sample, which is typically available in
supervised classification problems. The role of the learning sample is taken by the
probability sample but in a different way as in traditional classification problems
as described, for example, in Bishop (2006).

2.1 Regression Residuals


We first shortly sketch diagnostic tools in regression that are useful to distinguish
between target and non-target observations. More detailed descriptions of diag-
nostic tools are found in Cook and Weisberg (1982); Fahrmeir and Tutz (2001);
Fahrmeir et al. (2013).
For observations (yi , xi ), i = 1, . . . , n the linear model has the form yi =
xi β + εi , where xTi = (1, xi1 , . . . xip ) is the vector of explanatory variables,
T

β T = (β0 , β1 , . . . βp ) is the vector of coefficients, and the εi are independent noise


variables following a normal distribution N (0, σ 2 ). In closed matrix form one has
y = Xβ + ε, where y T = (y1 , . . . , yn ) is the vector of responses, X is the design
matrix composed from the vectors of explanatory variables, and εT = (ε1 , . . . , εn )
is the vector of errors. Fitting by least squares yields the familiar estimate β̂.
Residuals provide strong diagnostic tools. The simple residual ri = yi − xTi β̂
measures the discrepancy between the actual observation and the fitted value
ŷi = xTi β̂ and is a preliminary indicator for ill-fitting observations. However, it
does not take the variability of the estimate into account. It is simply derived that
the vector of residuals r = (r1 , . . . , rn )T = y − ŷ, where ŷ T = (xT1 β̂, . . . , xTn β̂)
has covariance matrix cov(r) = σ 2 (I − H), where H is the projection matrix
H = X(X T X)−1 X T . Therefore, one obtains with the diagonal elements from

3

H = (hij ) the variance var(ri ) = σ 2 / 1 − hii . Scaling of residuals to the same
variance produces the form
ri
r̃i = √ ,
1 − hii
with var(r̃i ) = σ 2 . If, in addition, one divides by the root of the estimated
variance σ̂ 2 = (r T r)/(n − p − 1), where p + 1 is the length of xi , one obtains the
studentized residual
r̃i yi − µ̂i
ri∗ = = √ ,
σ̂ σ̂ 1 − hii
which behaves much like a Student’s t random variable except for the fact that
the numerator and denominator are not independent.
An alternative useful approach is based on case deletion. Let β̂ (i) denote the
least-squared estimate resulting from the dataset in which observation (yi , xi ) is
omitted. The change in β that results if the i-th observation is omitted may be
measured by
∆i β̂ = β̂ − β̂ (i) .
For the linear model no refitting is needed since an explicit form is available given
by ∆i β̂ = (X T X)−1 xi ri /(1 − hii ), see, for example, Cook and Weisberg (1982).

2.2 Improved Estimates by Enlarging the Data Set


The basic concept is to use diagnostic tools for regression models to check if ob-
servations from the non-probability sample should be considered as data from the
target population or not. The main source of information is the probability sam-
ple, which is known to be from the target population. For each single observation
from the non-probability sample it is investigated if it is an “outlier”.
The basic algorithm is the following:

(1) Fit the regression model yi = xTi β + εi by using the probability sample to
obtain β̂ 0 .

(2) Fit the regression model to data S0 ∪{(yi , xi )}, where (yi , xi ) ∈ S N P , that is,
one observation from the non-probability sample is added to the probability

sample, to obtain β̂ (i) and the corresponding studentized residual rs,i

(3) Observation (yi , xi ) is considered to be from the target population if two


criteria are fulfilled:

|rs,i | ≤ ts ,
where ts is a threshold for the studentized residuals, and

kβ̂ 0 − β̂ i k
∆i = < tc ,
kβ̂ 0 k

4
where tc is a threshold for the change in parameters when observation
(yi , xi ) is included.

(4) All observations that fulfill both criteria are added to the probability sample
to obtain the extended sample, which is used to compute the final estimate.

The first criterion uses the studentized residual in the enlarged sample S0 ∪
{(yi , xi )}. If the observation (yi , xi ) is from the target population its studentized
residual should not be too large. The second criterion is based on case deletion
but case deletion in the enlarged sample S0 ∪ {(yi , xi )}. It compares parameter
estimates in the probability sample with estimates in the sample enlarged by one
observation. Thus it is case deletion regarding the enlarged sample but can also
be seen as case addition regarding the original probability sample. When adding
the i-th observation the change in parameters should not be too large if it is from
the target population.

Choice of Thresholds
The procedure depends on the choice of the thresholds ts and tc , which we link to
quantiles. Given a nominal error level αst , we choose for the studentized residuals
ts = q1−αst , where q1−αst denotes the (1 − αst )- quantile of the standard normal
distribution, in the applications we use αst = 0.05.
The choice of the threshold for the change in parameters is based on the
distribution of changes in parameters that are to be expected in a sample from the
target population. In the probability sample, which is from the target population,
there is natural variation in estimates if one observation is omitted. We explicitly
investigate this distribution in the following way:
(a) For all observations from the probability sample fit the model by deleting
one observation at a time. Fitting the model by using observations S0
without {(yi , xi )}, that is, without the i-th observation , yields β̂ −i

(b) Consider the distribution of changes in parameters if one observation is


omitted with changes given by

kβ̂ 0 − β̂ −i k
chi = .
kβ̂ 0 k

(c) Choose tc as the (1 − αch )-quantile of the empirical distribution of the


values of chi .
The distribution of chi shows which changes are to be expected if observations
are from the target population. If the relative change when including an obser-
vation from the non-probability sample is below the threshold tc it is considered
as coming from the target population.

5
The method to enlarge the data set uses two tuning parameters αst and αch .
If αst and αch are small many observations from the non-probability sample are
included in the extended data set, if they are large the number of included obser-
vations is small. Thus, the α-values are an indicator for the size of the extended
sample, with increasing α-values the extended sample gets smaller and in the
extreme case consists of the original probability sample.
It should be noted that the α-values are not significance levels in the tradi-
tional sense. They are tuning parameters, but it seems quite natural to describe
them in distributional terms. As tuning parameters they can also be chosen in
a data-driven way, for example, by cross-validation. For simplicity one can also
use αst = αch .

Cross-Validation
The tuning parameters αst and αch can be chosen as fixed values. An alternative
is to choose them in a data driven way by cross-validation. Then the probability
sample is split randomly into k distinct data sets S01 , . . . , S0k of approximately
equal size. One considers S0 \ S0j as learning data set, which is used to obtain an
enlarged sample, and S0j as validation sample, for which, based on the extended
sample, predicted values are computed. The discrepancy between the predicted
values and the observed values in S0j shows how well the procedure works in
terms of prediction. Computing the discrepancy for all sub samples S0j and
averaging yields a measure for the performance of fixed tuning parameters αst
and αch . Computation of the discrepancies on a grid of α values allows to choose
the values which show best performance. As discrepancy measure one can use
the squared error and α values from a grid of values. One can compute on the
whole grid or use a less demanding cross-validation method that uses a reduced
grid of values with αst = αch .

2.3 Accuracy Measures


For the parameter estimate β̂ a measure of accuracy is the mean squared error
M SE = ||β̂ − β||2 . It is used to compare the estimates obtained for the prob-
ability sample and the extended sample. One can also compare the estimates
of the two methods directly. Let β̂ P denote the parameter estimate when using
the probability sample, and β̂ E the estimate obtained from the extended sample.
A measure that compares the accuracy of these estimates is the relative mean
squared error,
||β̂ P − β||2
M SEr = .
||β̂ ext − β||2
It indicates how much larger the mean squared error of the simple probability
sample estimator is than the estimator obtained in the extended data set. It can
also be computed for the reduced coefficient vector that contains the coefficients

6
on explanatory variables only. This seems sensible if one is interested in the
impact of covariates only. If, however, one aims at prediction the whole vector of
coefficients makes more sense. The choice is of minor importance since we found
that the improvement obtained by the enlargement of the sample is very similar
in both types of parameter vectors.
Further measures of performance refer to the identification of observations in
the non-probability sample. Similar as in classification problems one can consider

hits: the proportion of observations correctly identified as coming from the


target population,

false positives: the proportion of observations incorrectly identified as com-


ing from the target population.

A large number of hits means that the information in the non-probability


sample is efficiently used, whereas a large number of false positives is an indicator
that estimates might be distorted. In the ideal case all the information in the
non-probability sample is used but no distortion of estimates is observed. It
should be mentioned that inclusion of observations from the polluted sample not
necessarily entails biased estimates since only observations are included that do
not change the parameters strongly.

3 Some Simulations
3.1 One Predictor
For illustration we consider briefly the case of just one predictor with fixed param-
eters. In a small simulation study we compare the estimators based on the prob-
ability sample and the extended sample. The simulation uses n = 40, n1 = 200,
n2 = 200. The distribution of covariates and the parameters are specified in the
following way.

Setting (a): Probability sample: normally distributed, µ0 = E(x) = 1,


regression parameters: β T = (β0 , β1 ) = (1, 1), var(ε) = 1
Polluting sample: normally distributed, E(x) = 2, regression parameters:
β T = (2, −1), var(ε) = 4

Setting (b): Probability sample: normally distributed, µT0 = E(x) = 1,


regression parameters: β T = (β0 , β1 ) = (1, 1), var(ε) = 1
Polluting sample: normally distributed, E(x) = 2, regression parameters:
β T = (3, 0.5), var(ε) = 9

Figure 1 shows the results for αst = αch = 0.05 (setting (a)). The upper left
picture shows the MSEs for the probability sample estimator (PSE, light) and the

7
estimator based on the extended sample (ExtE, dark). It is seen that the MSE of
the PSE (light blue) is distinctly smaller than the MSE of the ExtE (dark blue).
This is also seen from the upper right picture which shows the relative mean
squared errors. The inclusion of observations from the non-probability sample
yields strongly improved estimates. The second row shows the proportions of
correctly and incorrectly identified observations in the 50 simulated data sets
(left) and the observations of both variables for the first simulated data set (right).
It is seen that the number of observations from the non-probability sample that
are correctly identified as coming from the target population is always large, the
number of incorrectly assumed as coming from the target population is between
0.2 and .35. But as the distribution of the response and the predictor in the
right picture shows incorrectly identified observations do not necessarily affect
the accuracy of the estimator, it is only essential to exclude all observations that
might include bias. The last row shows the box plots of the estimates for the two
parameters β0 and β0 with the horizontal line indicating the true value. It is seen
that in particular the variability of estimates is smaller for the extended sample
denoted by regression extended.

8
n1= 200 n2= 200
6

0.6

Relative norm
4
method
MSE

0.4 1
2

2
0.2

0.0 0
40 40
n=40 Setting n=40
lab
Rates of identified observations across simulations NP1 NP2 P
0.35

6
4
Incorrectly identified
0.30

2
y
0
0.25

−6 −4 −2
0.20

0.75 0.80 0.85 0.90 0.95 −2 −1 0 1 2 3 4


Correctly identified as member of target population x

Variable= 0 Variable= 1
1.4

1.4
1.2

1.2
1.0
0.8

1.0
0.6

0.8
0.4

Prob Sample Regression extended Prob Sample Regression extended

Figure 1: Simulation univariate predictor (setting (a)); upper left picture:


MSEs for the probability sample estimator (light) and the estimator based on
the extended sample (dark); upper right picture: relative mean squared errors;
second row, left: proportions of correctly and incorrectly identified observations;
second row, right: observations for the first simulated data set; third row: box
plots of the estimates for the two parameters β0 and β0 with the horizontal line
indicating the true value.

Figure 2 shows the corresponding pictures for setting (b). Although the num-
ber of incorrectly identified observations is much larger than in setting (a) the
improvement in terms of MSE is very strong. The reason is that the observations
that are incorrectly included do not distort the estimate since the slope in the pol-
luted sample differes from the slope in the target population but not as strongly
as in setting (a), where the signs of the slopes differ. For further illustrations see
Appendix.

9
n1= 200 n2= 200

0.6 10.0

Relative norm
7.5
method
MSE

0.4
1
2

5.0

0.2
2.5

0.0
40 40
n=40 Setting n=40
lab
Rates of identified observations across simulations NP1 NP2 P

10
0.50
Incorrectly identified

8
6
y
0.40

4
2
0.30

0
−2

0.75 0.80 0.85 0.90 0.95 −1 0 1 2 3 4


Correctly identified as member of target population x

Variable= 0 Variable= 1
1.4

1.3
1.2

1.1
1.0
0.8

0.9
0.6
0.4

0.7

Prob Sample Regression extended Prob Sample Regression extended

Figure 2: Simulation univariate predictor (setting (b)); upper left picture:


MSEs for the probability sample estimator (light) and the estimator based on
the extended sample (dark); upper right picture: relative mean squared errors;
second row, left: proportions of correctly and incorrectly identified observations;
second row, right: observations for the first simulated data set; third row: box
plots of the estimates for the two parameters β0 and β0 with the horizontal line
indicating the true value.

3.2 Multivariate Predictors


In the following we show the results of selected simulations for more than one
explanatory variables (100 data sets). Regression parameters and location of
the explanatory variables in the polluted sample are not fixed but vary across
data sets. The location of explanatory variables in the probability sample is
determined by the mean vector µT0 = E(x)T = (µ00 , . . . , µ0p ). In the polluted
sample to each component a normally distributed shift is added, E(xi ) = µ0i +
σloc ∗ N (0, 1), with σloc determining the strength of the shift. In the same way

10
the regression parameters in the polluted sample are determined as deviations
from the parameters in the probability sample, βi = β0i + σpar ∗ N (0, 1), with σpar
determining the strength of the shift. We consider the following settings.

Setting 1: Four predictors


Probability sample: normally distributed, predictors pairwise correlated
with correlation coefficient 0.3, µT0 = E(x) = (1, 1, 1, 1), regression param-
eters: β T = (β0 , . . . , β4 ) = (1, 1, 2, 3, 4), var(ε) = 1
Polluting sample: normally distributed, predictors pairwise correlated with
correlation coefficient 0.3, E(xi ) = µ0i + σloc ∗ N (0, 1), σloc =1, regression
parameters: βi = β0i + σpar ∗ N (0, 1), σpar = 1, var(ε) = 2

Setting 2: Same as setting 1 but now stronger variation in polluting sample


σpar = 2 or σpar = 4

Setting 3: Eight predictors


Probability sample:normally distributed, predictors pairwise correlated
with correlation coefficient 0.3, µT0 = E(x) = (1, . . . , 1), regression parame-
ters: β T = (β0 , . . . , β8 ) = (1.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0), var(ε) = 1
Polluting sample: normally distributed, predictors pairwise correlated with
correlation coefficient 0.3, E(xi ) = µ0i + σloc ∗ N (0, 1), σloc =2, regression
parameters: β = (1.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0), σpar = 1, var(ε) = 4

Figure 3 shows the mean squared errors of setting 1 with four explanatory
variables for varying numbers of observations n1 , n2 . when the number of obser-
vations in the probability sample increases from 40 to 200. For simplicity we
used αst = αch = 0.05, choice by cross-validation yielded very similar α values
and performance typically was comparable. It is seen that the MSEs for both
the probability sample estimator (PSE) and the estimator based on the extended
sample (ExtE) decrease with increasing sample sizes. The ExtE performs con-
sistently better than the PSE. Only in the case where there is not much to gain
from using the observations in the non-probability sample (n1 = 200) and there
are many polluted observations (n2 = 800) the dominance becomes weaker, in
particular for large probability sample sizes.

11
n1 =200, n2=400 n1 =400, n2=400

1.00

0.75

0.75

method 0.50 method


MSE

MSE
1 1
0.50 2 2

0.25
0.25

0.00 0.00
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample
n1= 200 n2= 800 n1 =400, n2=800

1.00

0.75
0.75

method method

MSE
MSE

1 0.50 1
0.50 2 2

0.25 0.25

0.00
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample

Figure 3: MSEs for varying numbers of observations in probability sample,


setting 1, method 1 (light) is estimate in probability sample, method 2 (dark) in
extended sample

The left picture in Figure 4 shows the comparison of estimates for varying
numbers of observations from the target population in the non-probability sample.
It is seen that the increased availability of observations from the target population
decreases the error in the extended sample. The right picture shows the effects
of varying numbers of observations in the polluted data sets. It is seen that
increasing the number of polluted observations hardly affects the performance of
the estimates in the extended sample.
n =40, n2=400 n =80, n1=200
0.8

0.9
0.6

method method
MSE

MSE

0.6
1 1
0.4
2 2

0.3
0.2

0.0
100 150 200 250 300 200 300 400 500 600
Number of observations from target in non−probability sample Number of observations in non−probability sample

Figure 4: MSEs setting 1; left: varying numbers of observations from target


population in the non-probability sample, right: varying numbers of observations
in the polluted data sets, method 1 is estimate in probability sample, method 2
in extended sample

12
Figure 5 shows the comparisons for varying numbers of observations in prob-
ability sample for setting 2 with stronger variation in the polluting sample. The
improvement of estimates is similar to that seen in setting 1.
n1 =200, n2=400 n1 =200, n2=400

0.75
0.9

method method
MSE

MSE
0.6 1
0.50 1
2 2

0.3 0.25

0.0
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample

Figure 5: MSEs for varying numbers of observations in the probability sample,


setting 2, left: σpar = 2, right: σpar = 4, method 1 is estimate in probability
sample, method 2 in extended sample

Figure 6 shows the comparisons for varying numbers of observations in prob-


ability sample for setting 3 with eight predictors and two settings of observation
numbers. The results are very similar to the results seen for four predictors. In
particular for small sample sizes of the probability sample the gain in accuracy
when using the extended sample is very pronounced.
n1= 200 n2= 200 n1= 200 n2= 600
1.2

0.9 0.9

method method
MSE

MSE

0.6 1 0.6 1
2 2

0.3 0.3

40 80 120 160 200 40 80 120 160 200


Number of observations in probability sample Number of observations in probability sample

Figure 6: MSEs for varying numbers of observations in probability sample,


setting 3, left: σpar = 2, right: σpar = 4, method 1 is estimate in probability
sample, method 2 in extended sample

4 Standard Errors
Standard errors for parameter estimates of linear models are easily obtained
P by
using cov(β̂) = σ 2 (X T X)−1 and replacing σ 2 by the estimator σ̂ 2 = i (yi −
xTi β̂)2 /(n − (p + 1)). When using the extended sample it is tempting to use
the same estimator with the extended design matrix. However, the estimator
is severely biased since it ignores the process that augments the data, it simply

13
assumes that the extended data set is a probability sample. The larger number
of observations creates the illusion that standard errors are much smaller than
they actually are.
A way to obtain reliable estimates is bootstrapping (Efron and Tibshirani,
1994; Davison and Hinkley, 1997). For a given data set consisting of the proba-
bility sample and the non-probability sample, the whole estimation procedure is
carried out repeatedly for data that are obtained by drawing observations from
the data set with replacement. The variation of the estimates is used to compute
realistic standard errors of parameters.
An approximation to the true standard errors is obtained by repeating the
estimation procedure for nrep = 100 drawings for fixed parameters. To obtain
bootstrap standard errors, we used nboost = 100 bootstrap repetitions. We show
the results for the following selected simulation settings.

Setting B1: Four predictors


Probability sample as in setting 1, in particular, µT0 = E(x) = (1, 1, 1, 1),
β T == (1, 1, 2, 3, 4),
Polluting sample as in setting 1 but µT0 = E(x) = (2, 2, 2, 2), β T =
(1, 1, 2, 3, 4) + (1, −1, 1, −1) are fixed.

Setting B2: Same as setting 1 but now stronger variation in probability


sample polluting sample var(ε) = 4 and non-probability sample var(ε) = 9.

Setting B3: Eight predictors


Probability sample as in setting 3 with µT0 = (1, . . . , 1), regression parame-
ters: β T = (β0 , . . . , β8 ) = (1.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0), var(ε) = 1
Polluting sample: normally distributed, E(x) = (3, . . . , 3), re-
gression parameters: β = (1.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0) +
(0, 1, −1, 1, −1, 1, −1, 1, −1), var(ε) = 4

Table 1 shows the resulting estimates. Prob sample denotes the standard
error in the probability sample, Non-prob naive are the standard errors obtained
as estimates of standard errors when fitting in the extended sample but inoring
that the sample has been extended. They are only shown to demonstrate that
these estimates should not be used. Non-prob actual denotes the approximated
actual standard errors and Non-prob boost the bootstrap estimates. It is seen that
the nominal standard errors Non-prob naive strongly underestimate the actual
standard error. The boosting errors are comparable to the actual errors but tend
to be slightly larger, which makes them somewhat conservative when used in
testing. Standard errors in the extended sample are smaller than the standard
errors in the probability sample, which typically is also seen in the bootstrap
estimates although they are slightly larger than the actual errors. However,

14
though bootstrap standard errors are not much smaller than standard errors in
the probability sample, estimates in the extended sample are more precise as has
been demonstrated in the simulations in Section 3.

Table 1: Standard errors for selected settings

β0 β1 β2 β3 β4
Setting B1 Prob sample 0.252 0.133 0.133 0.162 0.163
Non-prob naive 0.090 0.053 0.052 0.053 0.052
Non-prob actual 0.191 0.102 0.111 0.124 0.115
Non-prob boost 0.224 0.130 0.143 0.133 0.134
Setting B2 Prob sample 0.552 0.301 0.275 0.336 0.289
Non-prob naive 0.175 0.100 0.098 0.099 0.099
Non-prob actual 0.368 0.199 0.202 0.213 0.192
Non-prob boost 0.436 0.267 0.275 0.284 0.258
Setting B3 Prob sample 0.269 0.164 0.152 0.158 0.166
Non-prob naive 0.085 0.056 0.055 0.055 0.056
Non-prob actual 0.205 0.107 0.106 0.109 0.115
Non-prob boost 0.257 0.154 0.147 0.152 0.155

5 Illustrative Empirical Studies


5.1 Munich rent data
For illustration we use the Munich rent index data. The variables are rent (rent
in Euros, metric), floor (floor space, metric), rooms (number of rooms, metric),
district (residential area, nominal). For an extensive description see Fahrmeir
et al. (2013). The rent is the dependent variable and floor space and number of
rooms are explanatory variables. The district serves as indicator for the samples.
We consider a fixed district, for example “Inner City”, as the probability sample
and observation from other districts as samples from a different source. Part
of the sample from other districts could be considered as observations from the
probability sample.
Table 2 (upper part) shows the estimates for two districts, “Inner City” and
“Schwabing” and the estimates obtained if other districts are used to enlarge the
data set. For example, when considering “Inner City” as the probability sample
one has 47 observations. If one considers “Bogenhausen” as the non-probability
sample the number of available observations is 206 (47 from “Inner City” and
159 from “Bogenhausen”). The extended sample size (used observations) when
including the observations that can be considered as coming from the target
population is 198. If the district “Schwabing” is considered the non-probability
sample, the extended sample comprises 203 observations. It is seen that the stan-
dard errors (bootstrap for the enlarged sample) decrease distinctly by enlarging

15
the data sets.
In the second part of Figure 2 the same combinations of probability and non-
probability samples are considered but now we do not use the full probability
samples to estimate parameters. Since these are real data sets it is unknown
which estimates are better, the ones for the probability sample or the ones based
on the extended samples. Estimates in the probability sample could suffer from
outliers. Therefore, for a fixed probability sample we consider which observations
would be included if the same sample is used as non-probability sample. Then
we use only the observations that are considered as coming from the probability
sample to obtain a reduced probability sample. For the district “Inner City” from
the original 47 observations only 43 were included in the reduced data set. It is
seen that this can distinctly change the estimates. For example, the estimate of
floor, which is an important value since it means how much one has to pay per
square meter if it is accounted for the number of rooms, is 11.21 in the original
sample but 10.46 in the reduced sample, which makes a difference for people
renting flats.
The estimate in the reduced sample can be seen as a robustified estimator
since outliers are excluded. As estimator of the standard error in this case the
bootstrap estimator is given since the naive estimator is certainly biased. Table
3 shows how the samples are reduced if outliers are omitted. ’Kept’ refers to the
number of observations that are kept in the sample and ’deleted’ to the number
of observations that are left out in the robustified estimator.

Table 2: Munich rent data for various districts and enlarged data sets by in-
cluding neighborhood districts (standard errors obtained by bootstrap).
Parameters Std err
available used
observations observations area rooms std area std rooms
Inner city 47 47 11.21 -92.56 1.866 58.072
Enlarged by Bogenhausen 206 198 11.22 -64.87 1.569 43.937
Enlarged by Schwabing 212 203 10.21 -53.22 1.785 49.880
Schwabing 165 165 11.16 -42.36 1.119 31.581
Enlarged by Maxvorstadt 333 324 10.40 -26.64 1.203 25.907
Enlarged by Bogenhausen 324 314 11.27 -46.03 1.271 24.801
Inner city reduced 47 43 10.46 -97.99 2.187 66.978
enlarged by Bogenhausen 206 191 10.75 -69.64 1.413 42.952
enlarged by Schwabing 212 194 9.87 -59.18 1.524 43.958
Schwabing reduced 165 151 10.11 -46.08 2.050 63.679
enlarged by Maxvorstadt 333 319 9.62 -30.84 1.083 22.314
enlarged by Bogenhausen 324 290 10.35 -44.74 1.111 22.838

16
Table 3: Munich rent prob and non prob same

kept deleted proportion kept


Inner city 43 4 0.914
Schwabing 151 14 0.915

5.2 Fears Data


As a second data set we consider data from the German Longitudinal Election
Study (GLES), which is a long-term study of the German electoral process (Rat-
tinger et al., 2014). The data we are using originate from the pre-election survey
for the German federal election in 2017 and consist of responses to various items
adressing political fears. The participants were asked: “How afraid are you due
to the use of nuclear energy?” with answers given on Likert scales from 1 (not
afraid at all) to 7 (very afraid). The explanatory variables are age in years, gen-
der (0: male, 1: female) and Abitur (admission certificate to attend university,
1: yes, 0: no)
We consider data collected in a fixed country (in Germany called Land), for
example Bavaria, as the probability sample and observation from other countries
as samples from a different source. Table 4 shows some illustrative examples.
First we consider Bavaria as the target population and a country that is very close
to Bavaria, namely Baden- Wuertemberg (BW) as additional data set. In the next
step we consider data from Baden- Wuertemberg and Hessen as additional data.
In a similar way we consider Berlin as the target population, which is enlraged by
using data from the neighbourhood countries Brandenburg (Brand) and Saxony.
It is seen that including neighbourhood countries reduces the standard errors,
however the effect is less distinct if data from a second country are included.
Figure 5 shows fits for the fears data when robustified estimates are used. As
before they are obtained by using data from a country as probability sample and
the same data as non-probability sample to obtain the reduced sample, which
includes only observations that are considered as coming from the target popu-
lation. Now we excluded the variable Abitur, which showed no significant effects
(Table 4). It seems that fits for fixed countries tend to suffer from outliers. The
estimated parameters if data from a country are used as probability sample differ
distinctly from the fits obtained if in a first step the data set is reduced. But
estimates remain rather constant if data from neighbourhood countries are added
to the reduced data set. For Bavaria the standard errors decrease when adding
observations but do not improve if Berlin is considered the target population.

17
Table 4: Fears data for various countries and enlarged data sets by including
neighborhood countries (standard errors obtained by bootstrap).
Parameters Std err
available used
obs obs Age Gender Abitur Age Gender Abitur
Bavaria 317 317 0.010 0.566 -0.294 0.005 0.193 0.209
enlarged by BW 522 507 0.012 0.619 -0.273 0.004 0.166 0.206
enlarged by BW, Hessen 837 795 0.014 0.600 -0.226 0.004 0.165 0.197
Berlin 62 62 0.007 0.816 0.148 0.011 0.452 0.450
enlarged by Brand 208 189 0.011 0.572 0.567 0.010 0.429 0.413
enlarged by Brand, Saxony 380 341 0.014 0.703 0.246 0.010 0.435 0.404

Table 5: Fears data for various countries and enlarged data sets by including
neighborhood countries, robustified estimates by initially reducing the probability
sample.
Parameters Std err
available used
obs obs age gender std age std gender
Bavaria all obs 317 317 0.012 0.586 0.005 0.193
Bavaria reduced 317 290 0.020 0.874 0.004 0.171
enlarged by BW 495 464 0.023 0.850 0.003 0.168
enlarged by BW, Hessen 632 583 0.024 0.878 0.003 0.166

Berlin all obs 62 62 0.006 0.821 0.011 0.448


Berlin reduced 62 55 0.009 0.730 0.009 0.363
enlarged by Br 201 152 0.009 0.801 0.009 0.389
enlarged by Brand, Saxony 373 265 0.014 0.787 0.009 0.410

6 Concluding Remarks
It has been demonstrated that the inclusion of selected observations from a differ-
ent sample can strongly improve parameter estimation in regression modelling.
The model that has been considered was the classical linear model with least
squares estimates. The basic methodology, including observations based on the
investigation of residuals, can also be used if alternative estimators are used.
In regression with many explanatory variables nowadays shrinkage estimators
and other regularized estimators (Tibshirani, 1996; Hastie et al., 2009) are in
common use. It is straightforward to use residuals for these estimators to se-
lect observations. The approach could also be used in the fitting of extended
models as additive models, generalized additive models (Hastie and Tibshirani,

18
1990; Wood, 2017) and generalized linear models (McCullagh and Nelder, 1983;
Fahrmeir and Tutz, 2001) by using appropriate residuals.
The method to enlarge the probability sample used here is an in-or-out pro-
cedure. In future research one might alternatively compute weights based on
residuals and change in parameters, and put weights on observations in the ex-
tended sample. Observations in the probability sample might get weights 1 but
observations that are included get smaller weights depending on the values of
residual measures.

References
Baker, R., J. M. Brick, N. A. Bates, M. Battaglia, M. P. Couper, J. A. Dever, K. J.
Gile, and R. Tourangeau (2013). Summary report of the aapor task force on
non-probability sampling. Journal of survey statistics and methodology 1 (2),
90–143.
Bethlehem, J. (2016). Solving the nonresponse problem with sample matching?
Social Science Computer Review 34 (1), 59–77.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York:
Springer–Verlag.
Brick, J. M. (2015). Compositional model inference. JSM Proceedings, Survey
Research Methods Section, 299–307.
Chen, Y., P. Li, and C. Wu (2020). Doubly robust inference with nonprobability
survey samples. Journal of the American Statistical Association 115 (532),
2011–2021.
Cook, R. D. and S. Weisberg (1982). Residuals and Influence in Regression.
London: Chapman & Hall.
Cornesse, C., A. G. Blom, D. Dutwin, J. A. Krosnick, E. D. De Leeuw, S. Legleye,
J. Pasek, D. Pennay, B. Phillips, J. W. Sakshaug, et al. (2020). A review of
conceptual approaches and empirical evidence on probability and nonprobabil-
ity sample survey research. Journal of Survey Statistics and Methodology 8 (1),
4–36.
Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and Their Appli-
cation. Cambridge: Cambridge University Press.
Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap, Vol-
ume 57. CRC press.
Elliott, M. R. and R. Valliant (2017). Inference for nonprobability samples.
Statistical Science 32 (2), 249–264.

19
Fahrmeir, L., T. Kneib, S. Lang, and B. Marx (2013). Regression models.
Springer.

Fahrmeir, L. and G. Tutz (2001). Multivariate Statistical Modelling based on


Generalized Linear Models. New York: Springer.

Hastie, T. and R. Tibshirani (1990). Generalized Additive Models. London:


Chapman & Hall.

Hastie, T., R. Tibshirani, and J. H. Friedman (2009). The Elements of Statistical


Learning (Second Edition). New York: Springer-Verlag.

Jean-Claude, D. (1991). A theory of quota surveys. Survey Methodology 17,


163–181.

Kim, J. K., S. Park, Y. Chen, and C. Wu (2021). Combining non-probability and


probability survey samples through mass imputation. Journal of the Royal
Statistical Society: Series A (Statistics in Society) 184 (3), 941–963.

Lee, S. and R. Valliant (2009). Estimation for volunteer panel web surveys using
propensity score adjustment and calibration adjustment. Sociological Methods
& Research 37 (3), 319–343.

McCullagh, P. and J. A. Nelder (1983). Generalized Linear Models. New York:


Chapman & Hall.

Rattinger, H., S. Roßteutscher, R. Schmitt-Beck, B. Weßels, and C. Wolf (2014).


Pre-election cross section (GLES 2013). GESIS Data Archive, Cologne ZA5700
Data file Version 2.0.0.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society B 58, 267–288.

Vavreck, L. and D. Rivers (2008). The 2006 cooperative congressional election


study. Journal of Elections, Public Opinion and Parties 18 (4), 355–366.

Wiśniowski, A., J. W. Sakshaug, D. A. Perez Ruiz, and A. G. Blom (2020). In-


tegrating probability and nonprobability samples for survey inference. Journal
of Survey Statistics and Methodology 8 (1), 120–147.

Wood, S. (2017). Generalized Additive Models: An Introduction with R (2 ed.).


Chapman and Hall/CRC.

20
Appendix
Illustration Univariate Predictor
Settings (a) and (b) were considered in the text, now we consider one additional
setting:

Setting (c): Probability sample:normally distributed, µT0 = E(x) = 1, re-


gression parameters: β T = (β0 , β1 ) = (1, 1), var(ε) = 1
Polluting sample: normally distributed, E(x) = 2, regression parameters:
β T = (3, −2), var(ε) = 4

Figure 7 show the results.


n1= 200 n2= 200

0.6
6
Relative norm

0.4 method
MSE

1 4
2

0.2
2

0.0
40 40
n=40 Setting n=40
lab
Rates of identified observations across simulations NP1 NP2 P
0.30

5
Incorrectly identified
0.25

0
y
0.20

−5
0.15

−10

0.80 0.85 0.90 0.95 −2 −1 0 1 2 3 4 5


Correctly identified as member of target population x

Variable= 0 Variable= 1
1.6

1.2
1.4
1.2

1.0
1.0

0.8
0.8
0.6

0.6

Prob Sample Regression extended Prob Sample Regression extended

Figure 7: Simulation univariate c

21
Further Simulations Multivariate Predictor
We consider again Setting 1 with four predictors but now the noise in the prob-
ability sample is larger than in the polluted sample, var(ε) = 4 in probability
sample, var(ε) = 1 in polluted sample. In this case the choice αst = αch = 0.05
turns out to perform less convincing. Therefore we used cross-validation to obtain
the pictures in . The average values of αs (αst = αch ) chosen by cross validation
for the five settings was 0.155, 0.141, 0.203, 0.190, 0.193. Choice by cross valida-
tion on the full grid of α values from the grid 0.3, 0.2, 0.1, 0.05 showed no further
improvement.
n1= 200 n2= 200 n1= 200 n2= 400
2.0

1.5
1.5

method method
MSE

MSE

1.0 1 1.0 1
2 2

0.5 0.5

0.0
40 80 120 160 200 40 80 120 160 200
Number of observations in probability sample Number of observations in probability sample

Figure 8: Simulation setting 1 with noise in the probability sample larger than
in the polluted sample, cross-validated.

22

You might also like