Ratio Imputation Improvement
Ratio Imputation Improvement
Ratio Imputation Improvement
S CIENCES
M ASTER ’ S T HESIS
learning techniques
Author: Supervisor:
Agnese G IACOMELLO Dr. Jeroen PANNEKOEK
Student Number: 6078249
Abstract
When dealing with survey data, individual values are in practice frequently missing. The classi-
cal approach for business data in Official Statistics is to replace missing values based on a ratio
imputation model, which involves a single predictor variable. The methods in this paper explored
extensions of the classical method, with regard to both the number of predictors involved and the
base model. In particular, the robust Huber and Tukey estimators and the machine-learning tech-
nique boosting have been applied to the case of interest. Boosted and non-boosted models based
on three different methods (including one, two and all predictors) have been compared, and RMSE
and MAPE have been used to assess imputation accuracy. Non-boosted Tukey estimation with all
variables as predictors resulted in the most precise imputation model; at the individual level, Tukey
imputation resulted in an average decrease in MAPE of 48.98%, compared to the classical method.
1 Introduction
Missing data are an ongoing problem in real data research. The presence of missing values can
be due to a variety of factors, related both to the way in which data are collected and to the
information contained in the data itself. As an example, missing observations are more likely to
occur on private or sensitive topics as in the case of income-related surveys [4]. To ensure a correct
research, missing values need to be properly handled in the analysis process. Statistical methods
can be used to infer the information lost, so to limit the damage that the missingness produces on
In the past thirty years, a thorough research was developed with the goal of finding methods
that lead to unbiased estimates of the missing information. Little and Rubin [13] had a major
impact in the developments of the currently used techniques. They provided the basis of the actual
missing data processing, which can be non-extensively summarised in two categories: case dele-
tion and imputation methods [15]. The former, which consists of the deletion of cases in which
some information is missing, can often result in estimation bias, in particular with small sample
sizes [9]. Imputation, which dates back at least to 1966 [14], aims at inferring the most likely
value for the variable of interest by using the amount of information contained in the non-missing
values. In the presence of item nonresponse, where only some observations of a single respondent
are missing, the common approach is to replace missing values by imputation procedures [8, 2].
Imputation methods evolved in a number of different approaches, one of which is the focus
of this thesis, ratio imputation [18]. In particular, for business data in Official Statistics ratio
2
imputation is the most commonly used method to deal with item nonresponse [20]. Reasons for
that lie within the kind of data involved, which are usually non-negative (e.g. costs, turnover,
etc.) and proportional to a measure of size of the business itself (e.g. number of employees). The
ratio model here presents a number of advantages: (i) it does not involve an intercept, therefore
protecting against the prediction of negative values that would not fit business data; (ii) it comprises
the variance proportional to the mean characteristic usually found in business variables (i.e. larger
firms have larger errors); (iii) its simple model structure allows for an easy and direct interpretation
of results. As will be further explained in section 2, the classical ratio imputation method is based
on the ratio of the means of a single predictor and an output variable. With a focus on business
data in Official Statistics, the aim of this thesis is to find a ratio imputation procedure which leads
2 Theory
This section outlines the different models that have been used for the imputation of missing values.
The ratio model, which involves a single predictor variable, is the classic baseline model for which
The first improvement entails the robust estimation of this model, leading to a robust ratio
model. The Huber [7] and Tukey [21] robust models have been considered, which involve the use
The second line of improvement involves models with multiple predictors. In this regard,
the machine-learning technique boosting, which allows for the inclusion of multiple predictors
sequentially, has been applied to the case of interest. In particular, the following models have been
investigated: (1) extension of the ratio model to multiple predictors (Huber and Tukey robust linear
regression models); (2) Generalized Additive Models (GAMs), which allow to catch non-linear
patterns in the data; (3) Regression Trees, which do not require knowledge about the functional
form of the data. The boosted version of the above-mentioned models has been investigated, where
at each iteration one of the predictors enter the model, depending on its current predictive ability
In ratio imputation, a single predictor variable x is used in the prediction of a target variable y,
where the target variable represents the outcome of interest. In the ratio model, the ratio be-
tween target and auxiliary variable is assumed to be approximately constant and it is caught by the
quantity R. In the classical ratio model, R is estimated as the ratio between the means of y and x
3
computed on the units for which both auxiliary and target values are available, i.e. R̂ = ȳ/x̄. Given
a dataset with N units, the missing y-values can therefore be imputed following the ratio model as
in Equation 1:
To explain how the above-presented R̂ estimator is derived, the notion of loss function needs to
be introduced. The loss-function Lw is the weighted sum of squares of the observed residuals and
as such, the objective is its minimisation. How residuals are weighted defines the different types of
estimators: in the classical ratio model, which is based on weighted least squares (wls) estimation,
the variance is assumed to be proportional to the mean, or equivalently, to the predictor variable.
The wls loss function therefore weights the sum of squares of the residuals proportionally to the
1 N ei
2 iÂ
Lw = (p )2 , (2)
=1 s 2 xi
where ei = yi R̂xi , for i = (1, ..., N ) are the model residuals. The minimisation of this quantity,
achieved by setting its derivative to zero with respect to R̂, leads to the wls estimator R̂ = ȳ/x̄.
Although this ratio estimator is simple and intuitive, it involves only a single predictor and it
does not prevent the influence of outliers in the estimation process. This, together with the waste
of information when multiple target variables are available, may result in biased estimates.
The classical wls method works well for normally-distributed residuals. However, this condition
is often not met (for example, in the presence of outliers) and can result in an erroneous influence
on the ratio estimate, leading to a lower imputation accuracy. Robust statistics allow to reduce
outliers’ influence by down-weighting their impact in the estimation process. In this context, out-
liers are considered observations with high residuals from the fitted ratio model in Equation 1 [10].
Two robust estimators have been investigated: Huber [7] and Tukey [21]. These estimators
achieve their robustness by modifying the way residuals are weighted, and consequently their loss
4
function. To get an insight in how these estimators reduce the influence of residuals compared to
The choice of the tuning constant k regulates the amount of robustness of these estimators:
smaller values of k lead to more resistance to outliers, and vice-versa [22]. A suggested choice is
to pick k = 1.345s for the Huber and k = 4.685s for the Tukey one [10], where s is the standard
deviation of the residuals. This choice has been shown to be optimal for symmetrical distributions.
In Figure 1, a visualisation of both the Huber and Tukey loss function can be found, which is based
To extend the robust ratio model, a linear regression model with multiple predictors and without
an intercept has been considered, based on both Huber and Tukey estimators. Model estimation is
GAMs (Generalized Additive Models) are a class of models where, in contrast to linear regression,
the relationship between output and predictor variables is defined by additive smooth functions.
By including smooth functions (e.g. splines, step functions or higher order polynomials), GAMs
aim at capturing the non-linear patterns in the data. On p predictor variables xi (with i = 1, ..., p),
where ( f1 , ..., f p ) are p smooth functions, each of which can take a different form.
Regression Trees
An often used non-parametric method for predicting a numerical target variable is the regression
tree [19]. In regression trees, estimation is carried out based on some decision rules, which de-
fine the segmentation of the predictor space in a predefined number of regions. Regression trees
are very simple to implement and lead to easily interpretable predictions, reasons for which they
have gained large popularity in prediction and classification models. An important parameter of
regression trees is the tree depth D, which defines the maximum depth of the tree and therefore its
complexity. This quantity corresponds to the number of nodes along the longest path from the root
node, where the root node has a depth of 0. An example of regression tree and the corresponding
5
2.4 Boosting
The term ’Boosting’ refers to a family of step-wise Machine Learning algorithms that, given a
target variable and a set of predictors, allows to sequentially optimise a model performance by
adding predictors one at the time. The idea is to combine a set of weak learners (predictors that
predict marginally better than random) to create a strong learner by ensembling their predictive
abilities. The power of boosting lies in the fact that weak learners are simple and fast to imple-
ment, allowing to have an highly precise model with little computational effort.
A large number of boosting models has been developed in literature, and they differ from each
other mostly in the way the weak learners are included in the iterative process and in how the
models are combined to get the final one. Almost any statistical technique with tuning parameters
can be transformed to a weak learner [11], leading to a large variety of boosting methods. The
common feature among them is that they are all based on a defined set of weak learners, with the
Through boosting, it is possible to include information from more than one predictor while
keeping a relatively simple model structure. The high accuracy that boosting achieves [17] and
the possibility to include multiple predictors makes it an appealing method to improve ratio impu-
tation’s accuracy.
Gradient boosting
The gradient boosting model was introduced by Friedman in 2001 [5]. The basic principle of
gradient boosting is to find the best additive model by optimization of the loss function, which
corresponds to the model’s squared errors. At each iteration, a weak learner is fitted on the resid-
uals of the current model. The weak learner is chosen in a way that leads to the minimisation of
the loss function. The current model residuals are added to the previous model, and the procedure
More specifically, given a set of p predictors (x1 , ..., x p ) and a target variable y, boosting
achieves the optimal prediction of y given x by minimization of some specified loss function
6
L(y, F (x)).
n1 n o
nÂ
F̂ = argminF L(yi , F (xi )) = argminF E (L(y, F (x))), (4)
i=1
where L(y, F (x)) is assumed to be differentiable with respect to F(x) and F(x) reflects the
choice of the base learner (e.g., regression tree). An example of loss function for the Wls, Huber
As specified in Equation 5, at each iteration m the model will have the form of:
where:
• l is the learning rate (0 < l 1), which defines how quickly the model is updated (smaller
The decision on the number of iterations M depends on the kind of data and problem at hand.
Most programs that implement boosting offer an option to determine the optimal mstop iteration
(the one that leads to the best model in terms of accuracy), where 1 mstop M. To derive this
value, cross-validation or the AIC (Akaike Information Criterion [1]) are internally used.
The choice of the base learners defines the model itself, as will be further explained in the
next sections. Common choices are, for example, regression trees, linear models or additive linear
models. As explained by Bühlmann and Hothorn [3], the reasons behind the choice of base learn-
ers can be based on the structural properties of the boosting estimates and/or on maximizing the
Considering the focus of my thesis, I will first illustrate a gradient boosting algorithm with
Applied to the ratio model, on a dataset with n units the optimal prediction of y given p predictors
2. initialize the model with a constant value, for example the average observed response ȳ,
3. for m=0 to M:
(b) for each candidate predictor, fit a robust regression model h(x) to the residuals rm
and pick the predictor xk (with k = (1, ..., p)) that minimises the loss function. The
minimisation of the loss function is obtained by computing its negative first derivative
∂ L(yi ,F (xi,k )
with respect to F (xk ), i.e. [ ∂ F (xi,k )
], for i = (1, ..., n);
Fm+1 (x) = Fm (x) + l hm (xk ) = Fm (x) + l R̂xk , with k = (1, ..., p), (6)
where R̂ is the coefficient estimated from the robust model fitted in the previous step.
This model will have residuals rm+1 = y Fm+1 (x), which will be used in the next
As in normal robust regression, predictors that enter the model will have a specific coefficient:
the difference lies in the fact that in a boosted model, a predictor is included only if at some
iteration m it was the one leading to the lowest residual (in step 3b). It follows that the same
predictor can enter the model multiple times, and that the earlier a variable enters the model, the
higher its influence on the final estimate of the target variable (at each iteration the algorithm is
improving its performance, therefore residuals will be smaller and smaller). At the final iteration
M, each predictor xk (with k = (1, ..., p)) will have a specific coefficient, which is defined as:
M
 l · R̂k
[m]
R̂k = · (km ⇤ = km ), (8)
m=1
where:
8
[m]
• R̂k is the robust regression coefficient estimated at iteration m;
If follows, logically, that the coefficient of a variable is updated only at the iterations in which
this variable was included, i.e. at iterations where this variable led to minimisation of the loss
function.
If regression trees were to be used as base learner in place of robust linear regression, in item 3b
a regression tree would have to be fitted to the residuals. This gradient boosting algorithm has
consequently three tuning parameters: the number of iterations M and the learning rate l , as
defined in the previous section, and the tree depth D, as defined in section 2.3.
Gradient boosting where F (x) represents a regression tree leads to the following algorithm:
1. Select the maximum tree depth, D, the learning rate l and the number of iterations M;
2. Compute the average observed response, ȳ, and use it as the initial predicted value;
3. for m=0 to M:
(a) Compute the residual rm (the difference between the observed and the current predicted
values);
(b) Using the residuals rm as the response, fit a regression tree hm (x) of depth D for each
predictor separately, and pick the one that minimises the loss function;
(c) Update the predicted value by adding l hm (x) to the predicted values generated in the
previous step;
It can be noticed that the steps of the above algorithm correspond to the ones in the boosted
robust linear regression, as they are examples of gradient boosting based on different base learners.
Hence, the choice of base learners influences the way in which predictions are computed, but not
3 Methods
In this section, data and estimation methods are described. The focus is on understanding both
which model leads to the best imputation accuracy and whether the inclusion of multiple variables
Analyses are based on two datasets: the first dataset is from the Dutch Structural Business, where
274 units are recorded on six variables. These include Turnover, Number of employees, Cost of
purchases, Depreciation cost, Cost of employees and Other costs. Turnover and Number of em-
ployees are usually considered predictor variables, because they express a complementary mea-
sure of size of the business and are known from other sources (e.g., tax registers, chamber of
The second dataset is from the Dutch Healthcare, where information on 841 healthcare institutions
are recorded. The kind of variables included is similar to the previous dataset, and they consist of
Turnover, Number of employees, Depreciation cost, Total costs, Wage salaries, Pension costs and
Other costs.
As is often the case with business data, all the above considered variables are right skewed and
strictly positive. A number of outliers was present in both datasets, and analyses were based both
on the complete dataset and on a subset of the data where the most obvious outliers were excluded
(five in the Dutch Structural Business dataset and one in the Dutch Healthcare dataset). Results
A ratio model was fitted to the data for every combination of target-predictor variables. In each
model, the ratio R was estimated based on the wls, Huber and Tukey estimators.
To assess estimators’ accuracy on the out of sample predictions, data were divided in 80%
training data and 20% test data, where the test data served as induced missing values that were
used to test the accuracy of the fitted models. Namely, the true (known) values were hidden during
the model fitting, and used only in the determination of the model’s accuracy. The classical ratio
model was then fitted to the train data only, and used to predict the ’missing’ test data. Predictions
were computed following Equation 1, and the final model error for each estimator and variables’
combination was obtained by computing the accuracy measures presented in subsection 3.6.
To assess how boosting affects imputation’s accuracy, five different models have been fitted to the
two datasets:
10
• Regression Trees;
For both datasets, each of the target variables (i.e. all variables excluding Turnover and Num-
ber of employees) was used as dependent variable. To understand the influence of the number of
predictor variables on the model performance, three different approaches have been implemented
and compared. Namely, the above five models have been fitted, for each dependent variable,
to: (i) one predictor only (Number of employees); (ii) both the predictors (Number of employees
and Turnover); (iii) all the available variables (four for the first dataset and five for the second one).
To enhance comparability of results and reduce randomness’s influence, the same train-test
split has been used in the estimation of all the different models, by the use of the caret R package
[12]. More information on this package can be found in Appendix A, together with a simple hand-
in model tuning example. Considering the three approaches and the number of variables involved,
the number of fitted models amounted to 48 for the first dataset (4 target variables x 4 models x 3
methods) and 60 (5 target variables x 4 models x 3 methods) for the second one. These correspond
to fit 12 models to each target variable, with a total of 108 models. Boosting was performed on a
number of 100 iterations for each of the models, and results were compared based on the accuracy
In each model variable importance was assessed, a measure that is based on the order and
the number of times each predictor entered the boosting algorithm. Considering the nature of
boosting (as explained in subsection 2.4), the first variable that enters the model is the one which
better minimises the loss function, i.e. the one that is better able to predict the corresponding
output variable. Therefore, variable importance is a function of the reduction in squared error,
also called risk reduction. In particular, the decrease in squared error for each predictor is summed
up at each boosting iteration (i.e., at each iteration, each predictor gets a variable importance
value). For each variable, these values are then averaged across the M iterations to generate an
The self-frequency values were also inspected, which are the percentages representing the
relative number of times a variable is included in the boosting model (at each iteration M, one
variable is included). In a given model, the self-frequencies sum up to one, and the variable with
the highest self-frequency is not necessarily the one with the highest variable importance value:
11
for example, a predictor which is included only once at an earlier model iteration might bring
a higher improvement compared to another predictor which is included multiple times at later
stages. This is easily explained by the nature of boosting, where model’s residuals are sequentially
reduced: as the number of iterations increases, residuals will be smaller and smaller. Therefore,
if a variable is included multiple times at later stages, its influence on residuals’ reduction will be
lower, as residuals have already been reduced at all the previous iterations.
Given the focus on robust estimation and boosting, with the goal to compare models to their
boosted version, an extension of the classical ratio imputation model was introduced. Following
the methods of the previous section, a non-boosted ratio imputation model was fit with (i) one
predictor only (Number of employees), which correspond to the model fitted in subsection 3.2; (ii)
both the predictors (Number of employees and Turnover); (iii) all the available variables (4 for the
first dataset and 5 for the second one). For each target variable, 6 more models were fit (3 for
Huber estimation and 3 for Tukey estimation). Analyses were again based on the same train-test
data division, and as in the classical ratio model, the intercepts were excluded.
The boosted robust models were then compared to their non-boosted versions to assess whether
the higher model complexity induced by boosting has an impact on the models’ predictive ability.
Also, an overall accuracy comparison among boosted and non-boosted models was performed.
Given the large number of models, a visual comparison of the 18 models (12 boosted models and
6 non-boosted) was created, where the change in accuracy across the three methods (one, two or
To check whether models with (1) one predictor compared to models with two predictors and
(2) all variables as predictors compared to models with two predictors led to a consistent accuracy
improvement, the percentage decrease of RMSE (Root Mean Squared Error) and MAPE (Mean
Absolute Percentage Error) has been investigated on the model that overall performed best. A
10% difference in accuracy has been chosen as a threshold to consider the improvement effective.
The focus of Statistics Netherlands is often in the estimation of means and totals of the variables
of interest, where missing values are present. In this regard, the methods proposed in this paper
have been evaluated and compared to the standard ratio imputation procedure used by National
Statistics Institutes, with the goal of finding a method that leads to a post-imputation mean as
12
close as possible to the real population one. The standard among National Statistics Institutes is
to impute values based on a classical ratio model with either Turnover or Number of employees as
predictor variable.
The true (observed) mean of each target variable has been compared with the post-imputation
mean obtained with (1) classical ratio imputation; (2) the newly-proposed method (non-boosted
The true (observed) mean has been computed on the complete dataset (i.e., before the exclusion
of the test data), while the means for the classical imputation and the Tukey robust methods were
computed on the complete data where the actual test data values were replaced by the predictions
made by the corresponding models. Accuracy at the aggregate level has been based on the MAPE.
The predicted values were compared to the (known) true y-values to assess models’ performance.
- The Mean Absolute Percentage Error (MAPE), which provides a standardized accuracy
measure in terms of percentages. Lower values correspond to better predictions. It is defined as:
1 N |yi ŷi |
N iÂ
MAPE = ( ) ⇤ 100, (9)
=1 |yi |
where:
- The Root Mean Squared Error (RMSE), which corresponds to the standard deviation of
s
1 N
N iÂ
RMSE = ( yi ŷi )2 . (10)
=1
It measures the average magnitude of the error, and the squared term causes high errors to have a
4 Results
Results will be compared, and estimation accuracy will be assessed, as presented in section 3: (1)
between the classical ratio imputation method and the use of robust statistics with one predictor
variable; (2) on the five boosted models, where each model has been fitted with one, two, or all
variables as predictors; (3) on the boosted and non-boosted robust models, where the non-boosted
As introduced in subsection 3.1, the standard among Statistical Offices is to use either Turnover
or Number of employees as predictor variable in the classical ratio model. For all models that
involved only one predictor variable, results are here presented for the predictor variable Number
of employees. Results obtained with the predictor variable Turnover are comparable and have
therefore been omitted. Also, detailed results have been reported for the target variable Other
costs from the Dutch Structural Business dataset; detailed numerical results for the other variables
At first, the wls, Huber and Tukey estimators were used to derive the ratio (R) estimates on the
complete data, based on the loss functions presented in Table 1. Results have been summarized
in Table 2 for the predictor variable Number of employees and the target variables in the Dutch
Subsequently, the same models were fit to the train data, while the test data have been used
as ’induced’ missing values to assess model’s accuracy. The MAPE and RMSE were used to
assess accuracy among the different estimators, based on the comparison between the predicted
and actual target values. Table 3 includes accuracy results (where lower values indicate a lower
prediction error) for the target variables in the Dutch Structural Business dataset and the predictor
variable Number of employees. Models’ accuracy is very dependent on the predictive ability of
the predictor variable with respect to the target one: for example, it is clear that Number of em-
ployees is not a good predictor for the target variable Cost of purchases, while it is a good one for
Depreciation cost, leading to imputed values that are ⇡ 1.4% off the real ones.
Considering both datasets, robust estimation with one predictor variable compared to the wls
method led to a decrease in MAPE in seven target variables out of nine (⇡ 77%), with Tukey
estimation performing slightly better than the Huber one most of the times. When considering the
Tukey model and the predictor variable Number of Employees, the average (over all nine target
4.2 Boosting
As explained in subsection 3.3, for each of the nine target variables accuracy results were com-
pared on three models: the simplest one where only one predictor variable is included, the second
one where both predictor variables enter the model and the third one which comprises both pre-
dictor and target variables. Boosted GAMs, Regression trees, Huber and Tukey robust models
were compared. The accuracy measures for each target variable are reported in Appendix B. Con-
sidering all target variables, only in two cases out of nine (⇡ 22%) a GAM with two predictors
performed better than the other models. In all the other cases, the inclusion of all variables as
predictors brought an improved imputation accuracy. In particular, the boosted GAM including all
variables outperformed all other models three times out of nine (⇡ 33%), while for the remaining
four target variables the best imputation model resulted to be the boosted-Huber (⇡ 44%).
Variable importance was assessed to understand which predictors brought the highest con-
tribution to the boosted models. In particular, the focus was on understanding whether the two
predictors that are usually used in ratio imputation models (Turnover and Number of employees)
are actually the ones that had the major impact in the boosting algorithms. On the boosted-Huber
models, for each of the nine target variables variable importance values were inspected: Turnover
appeared to be within the two variables with the highest contribution five times out of nine, while
Number of employees only one time out of nine (for the target variable Cost of employees, as they
are consistently measuring two related-quantities). It is noted that the two predictor variables never
appeared to be, together, the two best predictive inputs. This suggests, as it is confirmed by the
other results, that the inclusion of multiple predictors can be of great importance for an increased
model accuracy. An example of the variable importance values for the target variable Other costs
is reported in Figure 3, where both the variables’ inclusion in terms of frequency and the variables’
In this case, the variables Cost of employees and Depreciation cost resulted to be considerably
better predictors than the standard predictor variables Number of employees and Turnover. Cost
of employees was included in the boosted model 47% of the times (i.e., 47% of the M number of
iterations), and contributed to the 73% of the risk reduction, while Depreciation cost was included
in 19% of the iterations and contributed to 29% of the risk reduction. On the other hand, Turnover
has been included in the boosted model 11% of the times, contributing to only 0.03% of the
risk reduction; also, and even though Number of employees was included 17% of the times, its
contribution in terms of risk reduction was null (⇡ 0%). In this example it is clear how the self-
frequency of a variable is not directly related to its contribution in terms of risk reduction, as
To determine whether the increased-complexity of the boosted models conveyed a satisfactory in-
crease in accuracy, these have been compared to the non-boosted robust Huber and Tukey models.
In seven cases out of nine (⇡ 77%), robust non-boosted estimation resulted superior than all the
boosted methods both in terms of lowest RMSE and MAPE. In particular, Tukey estimator with
all variables as predictors resulted in the lowest imputation errors six times out of nine (⇡ 66%).
Consistently in all of the nine target variables (100%), the use of robust estimation (boosted or
non-boosted) resulted in the lowest accuracy errors. This implies that in the five cases where
boosted GAMs outperformed the other boosted models (see subsection 4.2), non-boosted robust
models resulted superior to the GAMs in all cases. Figure 4 provides a graphical representation
of how different models affect imputation’s accuracy for the target variable other cost, where it is
clear that a sharp improvement in MAPE has been obtained with the inclusion of all variables in
non-boosted Tukey estimation. Even though in some models the inclusion of multiple predictors
resulted in a decreased accuracy compared to the same model with one or two predictors (e.g.,
GAM in this example), the model that on the whole performed better has been shown to always
include all predictors, for all of the target variables. Detailed numerical results for the other eight
On the whole, non-boosted Tukey estimation with all variables as predictors resulted to be the
model with the highest accuracy and is therefore the suggested imputation model. On the nine
target variables, the MAPE for this model ranged between 0.14 and 4.47, with a mean of 1.74.
This implies that on average, this model will lead to individual imputation values that are 1.74%
As introduced in subsection 3.4, the percentage decrease in RMSE and MAPE on the overall
best model (non-boosted Tukey) has been investigated between models with (1) one compared to
two predictors; (2) two compared to all predictors. The inclusion of two predictors compared to
the one predictor models resulted in an average percentage decrease in RMSE of 19.39% and of
27.46% in terms of MAPE. Furthermore, the inclusion of all variables compared to models with
only two predictors led, on average, to a 42.05% decrease in RMSE and a 43.69% decrease in
MAPE. Hence, the considered comparisons largely overstepped the 10% threshold from which
In Table 4, an overview of the accuracy comparison based on the MAPE between the classical
wls model and the new method is presented. Only in one variable out of nine (⇡ 11%) the new
method did not resulted in an improved prediction accuracy. For all other variables, the improve-
ment ranges between 12.20% and 96.97%. The overall mean accuracy improvement, based on the
MAPE, is of 48.98%.
16
The comparison of the post-imputation means between the classical and the newly proposed
method (non-boosted Tukey with all predictors) are reported in Table 5 for each target variable,
Only in four target variables out of nine (⇡ 44%) the new method resulted in a more precise
post-imputation mean. The post-imputation means obtained with the new method presented a
percentage difference (MAPE) compared to the true (observed) means that ranged between 0.03%
and 4.16%, with an average of 1.13%. On average, the post-imputation means obtained with this
This paper investigated novel methods to achieve a more precise ratio imputation of business miss-
ing data. The standard in National Statistical Offices is to impute missing values based on a ratio
model with one predictor variable (Turnover or Number of employees), where wls estimation is
usually performed.
At first, classical and robust techniques have been compared and discussed. Table 2 shows how
the use of robust statistics impacts the ratio estimate when only one predictor variable is involved.
Imputation performance suggests that the robust ratio models perform better than the classical
ratio estimator in 77% of the cases, although the improvement in MAPE brought by the robust
Next, the machine learning technique boosting has been applied to the case of interest, allowing for
a step-wise inclusion of multiple predictors. Boosted GAMs, Regression Trees, Tukey and Huber
models have been compared for each of the nine target variables separately, where each model has
been fitted with one, two and all variables as predictors with the goal to assess how the inclusion
of multiple variables affects imputation accuracy. Boosted models have also been compared to the
non-boosted robust models (Huber and Tukey) with one, two and all predictors. The most precise
model resulted to be the non-boosted Tukey with all variables in six target variables out of nine
(⇡ 66%), the non-boosted Huber in one target variable (⇡ 11%, where Huber estimates resulted
to be very close to the Tukey ones in all cases) and the boosted Huber in the remaining two vari-
ables (⇡ 22%). It follows that robust estimation (boosted or non-boosted) with the inclusion of all
predictors outperformed the other methods for all the considered target variables.
On the whole, non-boosted Tukey estimation led to the best model and is therefore suggested as
future method for business missing data imputation. This model leads to individual imputed val-
ues that are, on average, 1.74% off compared to the real ones. When all predictors were included,
17
this model led to an average percentage decrease in RMSE of 42.05% and of 43.69% in terms of
MAPE, compared to the same models with the two standard predictors only. The present find-
ings confirm that to achieve an higher imputation performance, the inclusion of information from
other available variables is of definite importance. Even though the initial choice of predictors
was based on practical reasons (see subsection 3.1) and on the standard choice among Statistical
Offices, it is reasonable that the inclusion of as much information as available enables the model
to catch multiple aspects of the data, leading to an improved imputation model. Also, the inspec-
tion of variable importance in the boosted models showed that the two standard predictors never
The newly proposed method, when compared to the classical ratio model where either Number of
employees or Turnover are used as predictor variables, brought on average an increased accuracy
in MAPE of 48.98%. Hence, individual prediction of missing values can greatly benefit of the this
new method.
Considering that the focus of National Statistical Institutes is often on means and totals of the
target variables of interest, in the end the newly proposed method has also been compared to the
standard ratio model at the aggregate level. When Number of employees was used as predictor
variable, only in ⇡ 44% of the cases the new method led to a more precise post-imputation mean.
It is of interest noting the differences between results at the individual level and the ones at the
aggregate level: the very high improvement in accuracy obtained with Tukey imputation is not re-
flected in the accuracy of the same method at the aggregate level. This might be due to the fact that
the inclusion of multiple predictors favors the model to catch multiple and very precise aspects of
the data at the individual level (where each business is considered as unit), while at the aggregate
level a single variable reflecting the size of the businesses provides a simpler, yet more effective,
Future research could assess whether the present findings can be extended to different kinds
of data. Also, as described in subsection 2.2, in robust estimation the proposed tuning constants k
have been shown to be optimal for symmetrical distributions [10]. Considering the skewed nature
of the examined data, future research should consider a different procedure for the determination
of the tuning constants. Lastly, it is noted that all the boosted models have been fitted for a fixed
number of iterations (M = 100); as explained in section 2.4, the maximum boosting performance
is obtained when the model is run for the optimal number of iteration mstop , usually obtained
through cross-validation. Although this process is implemented in the caret R package, it is lim-
ited to the Gaussian family. It can be of great importance for future studies to develop a procedure
that extends the automatic choice of the best mstop value to non-Gaussian families and in particular,
18
considering the promising results obtained with robust statistics, to the Huber and Tukey families.
whether the great improvements obtained at the individual level can be extended to other levels of
In conclusion, this paper provides evidence of a method that effectively improves imputation
accuracy of business missing data at the individual level. When detailed information is the focus
(e.g., modeling, individual predictions, etc.), the new method brings on average a 48.98% accu-
racy improvement (based on the MAPE) compared to the standard method. On the other hand, at
higher levels of aggregation (i.e., estimation of the means or totals), the classical wls imputation
method is suggested, for its simplicity and its greater accuracy at the aggregate level.
19
Appendices
21
Appendix A
The caret (classification and regression training) package [12] was created by Max Kuhn in 2008,
with the goal of having a comprehensive tool for complex regression and classification models
in R [16]. Khun developed the package based on the available R packages for prediction, so
that a single interface could serve a large number of models based on the same syntax. The idea
was born to solve the inconsistencies among the various packages that perform predictive models:
different tuning parameters settings and slightly different prediction methods lead to confusion
and difficult model comparison. The caret package solved these issues by providing an unified
interface and syntax, as well as an increased computational efficiency by the use of parallel pro-
cessing (the computational process is split into several simultaneous processes that run in parallel).
On a variety of predefined models (112, including the most used lm, glm, etc.), caret supports:
• data splitting and pre-processing (centering, scaling, imputation with different methods,
etc.);
– the function trainControl allows to define the resampling method (e.g., bootstrap-
ing, cross-validation, repeated cross-validation, etc.) to be used during the model fit-
ting process;
– the function train enables the user to fit each of the 112 models based on a unified
syntax;
to fit the model on a specified set of tuning parameters, these can be provided in the
expand.grid command;
22
– the user can specify the preferred accuracy measure(s) to assess model performance,
which will be evaluated at each different combination of the tuning parameters. The
• predictions can be obtained through the predict function, which (unless specified differ-
ently) automatically uses the combination of model hyperparamethers which led to the best
accuracy measure.
It follows that caret’s model fitting procedure allows to easily check whether, e.g., a boosted
tree model performs better with 2, 4 or 10 maximum tree depth. Also, straightforward plot-
ting options are available to visually inspect how models’ accuracy is influenced by the different
combinations of tuning parameters. The influence in accuracy can be compared both within and
between models: from one side, as just mentioned, different combinations of tuning parameters
can be easily compared on the same model. On the other hand, different models run on the same
dataset can be likewise compared. Another very attractive feature of this package is that it allows to
compare multiple models based on the same resampling profiles. For instance, when k-fold cross
validation is implemented, the data sub-samples can be set to be exactly the same among different
models. In this way, the randomness induced by the resampling procedure plays a smaller role in
the determination of which model performs best, leading to a clear and reliable model comparison.
The comprehensiveness of the caret package, together with its other attractive features and the
consistent way in which models are assessed, makes it an useful guidance for the researcher on
model choice.
L ISTING A.1: Example of a boosted regression tree model fitting with caret
grid < e x p a n d . g r i d ( s h r i n k a g e = s e q ( 0 . 1 , 1 , by = 0 . 4 ) ,
i n t e r a c t i o n . depth = c (2 , 4 , 10) ,
n . t r e e s = c (100 , 300 , 500 , 1000) )
Appendix B
Accuracy measures in terms of RMSE and MAPE are here presented for each model (one predic-
Results for the target variables in the Dutch Healthcare dataset are presented in Table B.1, Ta-
ble B.3, Table B.2, Table B.4 and Table B.5, while results for the target variables in the Dutch
Structural Business dataset are presented in Table B.6, Table B.7, Table B.8 and Table B.9.
TABLE B.1: Imputation error assessment for the target variable Wage salaries
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 616.15 546.27 297.99 2.60 1.79 1.21
Regression Tree* 671.93 704.56 461.17 2.87 2.22 1.57
Huber* 621.65 533.48 381.53 2.45 1.77 1.45
Tukey* 635.82 525.85 451.09 2.61 1.87 1.76
Huber 631.92 546.60 139.68 2.45 1.81 0.63
Tukey 621.61 500.90 125.20 2.44 1.77 0.58
* Models that have been boosted.
TABLE B.2: Imputation error assessment for the target variable Other costs
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 476.68 370.55 321.40 1.90 1.37 1.18
Regression Tree* 471.13 326.51 276.62 1.99 1.28 1.20
Huber* 531.21 386.13 361.06 2.00 1.37 1.22
Tukey* 552.77 417.37 414.15 2.04 1.42 1.39
Huber 527.76 386.26 162.75 1.98 1.39 0.71
Tukey 534.98 395.54 135.31 1.99 1.39 0.66
* Models that have been boosted.
24
TABLE B.3: Imputation error assessment for the target variable Pension costs
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 58.05 64.01 51.96 0.24 0.21 0.15
Regression Tree* 60.71 71.50 68.51 0.29 0.27 0.21
Huber* 54.62 63.19 51.38 0.22 0.21 0.14
Tukey* 54.06 58.99 50.40 0.23 0.20 0.16
Huber 55.69 61.20 49.04 0.22 0.21 0.13
Tukey 54.47 51.97 48.63 0.22 0.18 0.14
* Models that have been boosted.
TABLE B.4: Imputation error assessment for the target variable Total costs
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 1073.19 917.71 647.37 4.91 2.10 1.55
Regression Tree* 1159.11 1030.65 684.04 5.41 2.50 1.88
Huber* 1133.68 856.86 374.16 4.72 2.11 1.25
Tukey* 1188.17 856.63 669.72 5.06 2.76 2.77
Huber 1131.21 894.77 173.21 4.69 2.15 0.72
Tukey 1129.77 855.63 139.68 4.69 2.13 0.64
* Models that have been boosted.
TABLE B.5: Imputation error assessment for the target variable Cost deprecia-
tions
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 93.85 91.47 91.06 0.42 0.38 0.38
Regression Tree* 92.92 90.61 90.76 0.42 0.39 0.39
Huber* 92.72 83.96 82.90 0.41 0.35 0.35
Tukey* 93.56 85.13 85.60 0.41 0.35 0.36
Huber 93.39 83.92 90.41 0.41 0.36 0.37
Tukey 95.04 85.66 87.47 0.41 0.35 0.36
* Models that have been boosted.
TABLE B.6: Imputation error assessment for the target variable Other costs
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 1566.72 1556.33 1906.90 9.40 8.58 9.22
Regression Tree* 1753.75 1852.52 1881.12 10.29 10.65 9.53
Huber* 1947.34 2014.20 1988.69 9.23 9.22 8.77
Tukey* 2015.31 2023.71 2053.70 9.19 9.07 8.83
Huber 1855.51 1987.76 973.39 8.08 8.59 4.56
Tukey 1838.57 1912.61 705.31 8.02 8.23 4.06
* Models that have been boosted.
25
TABLE B.7: Imputation error assessment for the target variable Cost purchases
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 35739.16 1881.65 1493.52 161.15 12.73 9.89
Regression Tree* 35933.39 12399.51 12399.51 166.54 37.94 37.94
Huber* 36058.81 5360.92 5353.10 152.72 24.10 23.22
Tukey* 36262.97 8256.81 8256.81 151.52 32.07 32.07
Huber* 36800.83 3137.41 679.59 144.86 14.23 4.47
Tukey* 37049.29 3578.34 690.35 146.17 15.42 4.47
* Models that have been boosted.
TABLE B.8: Imputation error assessment for the target variable Cost employees
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 706.89 666.94 721.20 4.57 4.35 4.54
Regression Tree* 783.23 818.89 792.63 5.29 5.54 5.34
Huber* 714.69 777.34 627.07 5.44 5.03 4.59
Tukey* 733.68 792.64 672.45 5.38 5.20 4.92
Huber 647.90 720.59 474.88 4.57 4.37 3.42
Tukey 696.31 815.45 486.43 4.36 4.42 3.39
* Models that have been boosted.
TABLE B.9: Imputation error assessment for the target variable Cost deprecia-
tions
RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 234.64 250.34 243.31 1.51 1.56 1.49
Regression Tree* 225.63 235.03 238.00 1.54 1.59 1.58
Huber* 225.35 225.35 216.07 1.49 1.49 1.39
Tukey* 229.70 229.70 221.00 1.49 1.49 1.40
Huber 225.35 225.35 216.07 1.49 1.49 1.39
Tukey 229.70 229.70 221.00 1.49 1.49 1.40
* Models that have been boosted.
27
Bibliography
[1] Akaike, H. (1998). A New Look at the Statistical Model Identification, pp. 215–222. New
[2] Allison, P. D. (2002). Quantitative Applications in the Social Sciences: Missing data. Series:
Quantitative applications in the Social Sciences. Thousand Oaks, CA, USA: Sage Publications,
Inc.
[3] Bühlmann, P. and T. Hothorn (2007, 11). Boosting algorithms: Regularization, prediction and
[4] de Waal, T., J. Pannekoek, and S. Scholtus (2011). Handbook of Statistical Data Editing and
Imputation. Wiley.
[5] Friedman, J. H. (2001, 10). Greedy function approximation: A gradient boosting machine.
[6] Friedman, J. H. (2002, February). Stochastic gradient boosting. Computational Statistics &
[7] Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical
[8] Kalton, G. and D. Kasprzyk (1986). The treatment of missing survey data. Survey Methodol-
[9] Kang, H. (2013). The prevention and handling of the missing data. Korean journal of anes-
[10] Kelly, G. E. (1992). Robust Regression Estimators-The Choice of Tuning Constants. The
[12] Kuhn, M. (2008). Building predictive models in r using the caret package. Journal of
[13] Little, R. J. A. and D. B. Rubin (1986). Statistical Analysis with Missing Data. New York,
[14] Palmer, S. and C. Jones (1966). A look at alternate imputation procedures for cps noninter-
[15] Qin, Y., S. Zhang, X. Zhu, J. Zhang, and C. Zhang (2007). Semi-parametric optimization for
[16] R Core Team (2013). R: A language and environment for statistical computing.
[17] Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the 16th Inter-
national Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, San Francisco, CA,
[18] Shao, J. (2000). Cold deck and ratio imputation. Survey Methodology 26(1), 79–85.
[19] Song, Y.-Y. and Y. Lu (2015). Decision tree methods: applications for classification and
[20] Takahashi, M. (2017). Implementing multiple ratio imputation by the emb algorithm. Journal
[21] Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley series in behavioral science.
[22] Wang, Y.-G. and Z. D. Bai (2007). Robust estimation using the huber function with a data-
dependent tuning constant. Journal of Computational and Graphical Statistics 16(2), 468–481.
29
Loss function
Wls Lw (e) = 12 ( p e 2 )2
s x
(
1 2
Huber LH (e) = 2e |e| k
1 2
k|e| 2k |e| > k
(
k2 {1 [1 ( ke )2 ]3 }/6 |e| k
Tukey LT (e) =
k2 /6 |e| > k
Where e = y f (x) are the model residuals and k is a predefined
constant.
31
Depth 0
Depth 1
Depth 2
Depth 3
R estimator
Target variable wls Huber Tukey
Cost of purchases 37.86 37.65 37.29
Cost of employees 0.87 0.83 0.82
Depreciation cost 0.0070 0.0072 0.0076
Other costs 33.01 29.01 27.31
33
RMSE MAPE
Target variable wls Huber Tukey wls Huber Tukey
Cost of purchases 36310.29 36800.83 37049.29 147.55 144.86 146.17
Cost of employees 643.72 647.90 676.31 4.42 4.56 4.36
Depreciation cost 222.23 228.90 231.60 1.39 1.40 1.42
Other costs 1493.98 1855.51 1838.57 7.83 8.08 8.01
34
Cost of employees
Cost_personnel
sel.freq:
sel. freq:~0.47
~0.47
Cost_depreciations
Depreciation cost
sel.
sel.freq:
freq:~0.19
~0.19
Variables
Variables
Turnover
Turnover
sel. freq:
sel. freq: ~0.11
~0.11
Number of employees
Employees
sel.freq:
sel. freq: ~0.17
~0.17
Cost of purchases
Costs_purchases
sel. freq:
sel. freq: ~0.06
~0.06
0
0.0000000 0.25
0.2498595 0.5
0.4997190 0.75
0.7495785 1
0.9994380
Risk
Risk reduction
reduction
F IGURE 3: Variable importance plot for the target variable Other costs (Dutch
Structural Business dataset), reporting risk reduction and self-frequencies values
35
RMSE
RMSE MAPE
MAPE
● ●
● ●
●
2000
2000 ●
●
●
●
● ●
10
10
●
● ●
●
●
●
●
● ● ●
●
●
● ●
●
Model
Model
● ●
1500 8 ●
1500 ● Boosted
Boosted GAM
GAM
● Boosted
Boosted Regression
Regression Tree
Tree
● Boosted
Boosted Huber
Huber
● Boosted
Boosted Tukey
Tukey
● Huber
Huber
● Tukey
Tukey
66
1000
1000
●
● ●
4
One
Onepredictor
predictor Two
Two predictors
predictors All variables
All variables One
One predictor
predictor Two predictors
Two predictors All variables
All variables
F IGURE 4: Accuracy results on the target variable Other costs (Dutch Structural
Business dataset)
36
Accuracy
Variable MAPE improvement (%)
Classical Tukey robust
ratio imputationa imputationb
TABLE 5: Change in target variable’s mean after imputation and MAPE based
on Tukey imputation. Bold values indicate the post-imputation mean closest to
the real one.