Ratio Imputation Improvement

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

U TRECHT U NIVERSITY

M ETHODOLOGY AND S TATISTICS FOR THE B IOMEDICAL , B EHAVIOURAL AND S OCIAL

S CIENCES

M ASTER ’ S T HESIS

Improvements of classical ratio-imputation

methods using robust statistics and machine

learning techniques

Author: Supervisor:
Agnese G IACOMELLO Dr. Jeroen PANNEKOEK
Student Number: 6078249

May 16, 2019


1

Abstract

When dealing with survey data, individual values are in practice frequently missing. The classi-

cal approach for business data in Official Statistics is to replace missing values based on a ratio

imputation model, which involves a single predictor variable. The methods in this paper explored

extensions of the classical method, with regard to both the number of predictors involved and the

base model. In particular, the robust Huber and Tukey estimators and the machine-learning tech-

nique boosting have been applied to the case of interest. Boosted and non-boosted models based

on three different methods (including one, two and all predictors) have been compared, and RMSE

and MAPE have been used to assess imputation accuracy. Non-boosted Tukey estimation with all

variables as predictors resulted in the most precise imputation model; at the individual level, Tukey

imputation resulted in an average decrease in MAPE of 48.98%, compared to the classical method.

1 Introduction

Missing data are an ongoing problem in real data research. The presence of missing values can

be due to a variety of factors, related both to the way in which data are collected and to the

information contained in the data itself. As an example, missing observations are more likely to

occur on private or sensitive topics as in the case of income-related surveys [4]. To ensure a correct

research, missing values need to be properly handled in the analysis process. Statistical methods

can be used to infer the information lost, so to limit the damage that the missingness produces on

the research output.

In the past thirty years, a thorough research was developed with the goal of finding methods

that lead to unbiased estimates of the missing information. Little and Rubin [13] had a major

impact in the developments of the currently used techniques. They provided the basis of the actual

missing data processing, which can be non-extensively summarised in two categories: case dele-

tion and imputation methods [15]. The former, which consists of the deletion of cases in which

some information is missing, can often result in estimation bias, in particular with small sample

sizes [9]. Imputation, which dates back at least to 1966 [14], aims at inferring the most likely

value for the variable of interest by using the amount of information contained in the non-missing

values. In the presence of item nonresponse, where only some observations of a single respondent

are missing, the common approach is to replace missing values by imputation procedures [8, 2].

Imputation methods evolved in a number of different approaches, one of which is the focus

of this thesis, ratio imputation [18]. In particular, for business data in Official Statistics ratio
2

imputation is the most commonly used method to deal with item nonresponse [20]. Reasons for

that lie within the kind of data involved, which are usually non-negative (e.g. costs, turnover,

etc.) and proportional to a measure of size of the business itself (e.g. number of employees). The

ratio model here presents a number of advantages: (i) it does not involve an intercept, therefore

protecting against the prediction of negative values that would not fit business data; (ii) it comprises

the variance proportional to the mean characteristic usually found in business variables (i.e. larger

firms have larger errors); (iii) its simple model structure allows for an easy and direct interpretation

of results. As will be further explained in section 2, the classical ratio imputation method is based

on the ratio of the means of a single predictor and an output variable. With a focus on business

data in Official Statistics, the aim of this thesis is to find a ratio imputation procedure which leads

to an increased imputation accuracy.

2 Theory

This section outlines the different models that have been used for the imputation of missing values.

The ratio model, which involves a single predictor variable, is the classic baseline model for which

improvements are investigated.

The first improvement entails the robust estimation of this model, leading to a robust ratio

model. The Huber [7] and Tukey [21] robust models have been considered, which involve the use

of a single predictor while offering protection against outliers.

The second line of improvement involves models with multiple predictors. In this regard,

the machine-learning technique boosting, which allows for the inclusion of multiple predictors

sequentially, has been applied to the case of interest. In particular, the following models have been

investigated: (1) extension of the ratio model to multiple predictors (Huber and Tukey robust linear

regression models); (2) Generalized Additive Models (GAMs), which allow to catch non-linear

patterns in the data; (3) Regression Trees, which do not require knowledge about the functional

form of the data. The boosted version of the above-mentioned models has been investigated, where

at each iteration one of the predictors enter the model, depending on its current predictive ability

with respect to the dependent variable.

2.1 Classical ratio model

In ratio imputation, a single predictor variable x is used in the prediction of a target variable y,

where the target variable represents the outcome of interest. In the ratio model, the ratio be-

tween target and auxiliary variable is assumed to be approximately constant and it is caught by the

quantity R. In the classical ratio model, R is estimated as the ratio between the means of y and x
3

computed on the units for which both auxiliary and target values are available, i.e. R̂ = ȳ/x̄. Given

a dataset with N units, the missing y-values can therefore be imputed following the ratio model as

in Equation 1:

ỹi = R̂xi , (1)

for i = (1, ..., N ).

To explain how the above-presented R̂ estimator is derived, the notion of loss function needs to

be introduced. The loss-function Lw is the weighted sum of squares of the observed residuals and

as such, the objective is its minimisation. How residuals are weighted defines the different types of

estimators: in the classical ratio model, which is based on weighted least squares (wls) estimation,

the variance is assumed to be proportional to the mean, or equivalently, to the predictor variable.

The wls loss function therefore weights the sum of squares of the residuals proportionally to the

reciprocal observed variance, leading to the function in Equation 2:

1 N ei
2 iÂ
Lw = (p )2 , (2)
=1 s 2 xi

where ei = yi R̂xi , for i = (1, ..., N ) are the model residuals. The minimisation of this quantity,

achieved by setting its derivative to zero with respect to R̂, leads to the wls estimator R̂ = ȳ/x̄.

Although this ratio estimator is simple and intuitive, it involves only a single predictor and it

does not prevent the influence of outliers in the estimation process. This, together with the waste

of information when multiple target variables are available, may result in biased estimates.

2.2 Ratio models with robust estimation

The classical wls method works well for normally-distributed residuals. However, this condition

is often not met (for example, in the presence of outliers) and can result in an erroneous influence

on the ratio estimate, leading to a lower imputation accuracy. Robust statistics allow to reduce

outliers’ influence by down-weighting their impact in the estimation process. In this context, out-

liers are considered observations with high residuals from the fitted ratio model in Equation 1 [10].

Two robust estimators have been investigated: Huber [7] and Tukey [21]. These estimators

achieve their robustness by modifying the way residuals are weighted, and consequently their loss
4

function. To get an insight in how these estimators reduce the influence of residuals compared to

wls estimation, their loss function can be inspected in Table 1.

The choice of the tuning constant k regulates the amount of robustness of these estimators:

smaller values of k lead to more resistance to outliers, and vice-versa [22]. A suggested choice is

to pick k = 1.345s for the Huber and k = 4.685s for the Tukey one [10], where s is the standard

deviation of the residuals. This choice has been shown to be optimal for symmetrical distributions.

In Figure 1, a visualisation of both the Huber and Tukey loss function can be found, which is based

on the k values suggested by Kelly [10].

2.3 Models with multiple predictors

Robust linear models

To extend the robust ratio model, a linear regression model with multiple predictors and without

an intercept has been considered, based on both Huber and Tukey estimators. Model estimation is

performed by minimising the robust loss-functions in Table 1.

Generalised Additive Models

GAMs (Generalized Additive Models) are a class of models where, in contrast to linear regression,

the relationship between output and predictor variables is defined by additive smooth functions.

By including smooth functions (e.g. splines, step functions or higher order polynomials), GAMs

aim at capturing the non-linear patterns in the data. On p predictor variables xi (with i = 1, ..., p),

a GAM will have the form of:

f (x) = y = a + f1 (x1 ) + ... + f p (x p ) + e, (3)

where ( f1 , ..., f p ) are p smooth functions, each of which can take a different form.

Regression Trees

An often used non-parametric method for predicting a numerical target variable is the regression

tree [19]. In regression trees, estimation is carried out based on some decision rules, which de-

fine the segmentation of the predictor space in a predefined number of regions. Regression trees

are very simple to implement and lead to easily interpretable predictions, reasons for which they

have gained large popularity in prediction and classification models. An important parameter of

regression trees is the tree depth D, which defines the maximum depth of the tree and therefore its

complexity. This quantity corresponds to the number of nodes along the longest path from the root

node, where the root node has a depth of 0. An example of regression tree and the corresponding
5

depth levels are shown in Figure 2.

2.4 Boosting

The term ’Boosting’ refers to a family of step-wise Machine Learning algorithms that, given a

target variable and a set of predictors, allows to sequentially optimise a model performance by

adding predictors one at the time. The idea is to combine a set of weak learners (predictors that

predict marginally better than random) to create a strong learner by ensembling their predictive

abilities. The power of boosting lies in the fact that weak learners are simple and fast to imple-

ment, allowing to have an highly precise model with little computational effort.

A large number of boosting models has been developed in literature, and they differ from each

other mostly in the way the weak learners are included in the iterative process and in how the

models are combined to get the final one. Almost any statistical technique with tuning parameters

can be transformed to a weak learner [11], leading to a large variety of boosting methods. The

common feature among them is that they are all based on a defined set of weak learners, with the

goal to optimise the predictive ability of regression and classification models.

Through boosting, it is possible to include information from more than one predictor while

keeping a relatively simple model structure. The high accuracy that boosting achieves [17] and

the possibility to include multiple predictors makes it an appealing method to improve ratio impu-

tation’s accuracy.

Gradient boosting

The gradient boosting model was introduced by Friedman in 2001 [5]. The basic principle of

gradient boosting is to find the best additive model by optimization of the loss function, which

corresponds to the model’s squared errors. At each iteration, a weak learner is fitted on the resid-

uals of the current model. The weak learner is chosen in a way that leads to the minimisation of

the loss function. The current model residuals are added to the previous model, and the procedure

continues iteratively for the specified number of iterations.

More specifically, given a set of p predictors (x1 , ..., x p ) and a target variable y, boosting

achieves the optimal prediction of y given x by minimization of some specified loss function
6

L(y, F (x)).

This translates in the estimation of the function:

n1 n o

F̂ = argminF L(yi , F (xi )) = argminF E (L(y, F (x))), (4)
i=1

where L(y, F (x)) is assumed to be differentiable with respect to F(x) and F(x) reflects the

choice of the base learner (e.g., regression tree). An example of loss function for the Wls, Huber

and Tukey regression estimators can be found in Table 1.

As specified in Equation 5, at each iteration m the model will have the form of:

Fm (x) = Fm 1 (x) + l hm 1 (x), (5)

where:

• m is the number of the current iteration (1  m  M);

• Fm 1 (x) is the model at the previous iteration;

• l is the learning rate (0 < l  1), which defines how quickly the model is updated (smaller

values lead to a slower model update);

• hm 1 (x) is a function of the model residuals at iteration m 1, i.e. hm 1 (x) =y Fm 1 (x).

The decision on the number of iterations M depends on the kind of data and problem at hand.

Most programs that implement boosting offer an option to determine the optimal mstop iteration

(the one that leads to the best model in terms of accuracy), where 1  mstop  M. To derive this

value, cross-validation or the AIC (Akaike Information Criterion [1]) are internally used.

The choice of the base learners defines the model itself, as will be further explained in the

next sections. Common choices are, for example, regression trees, linear models or additive linear

models. As explained by Bühlmann and Hothorn [3], the reasons behind the choice of base learn-

ers can be based on the structural properties of the boosting estimates and/or on maximizing the

predictive model ability.

Considering the focus of my thesis, I will first illustrate a gradient boosting algorithm with

robust linear regression as base learner.


7

Boosted robust regression

Applied to the ratio model, on a dataset with n units the optimal prediction of y given p predictors

is achieved through the following steps:

1. set the desired learning rate l and number of iterations M;

2. initialize the model with a constant value, for example the average observed response ȳ,

such that F0 (x) = 1n Âni=1 yi = ȳ;

3. for m=0 to M:

(a) compute the residuals rm = y Fm (x);

(b) for each candidate predictor, fit a robust regression model h(x) to the residuals rm

and pick the predictor xk (with k = (1, ..., p)) that minimises the loss function. The

minimisation of the loss function is obtained by computing its negative first derivative
∂ L(yi ,F (xi,k )
with respect to F (xk ), i.e. [ ∂ F (xi,k )
], for i = (1, ..., n);

(c) Update the estimate of the target variable as:

Fm+1 (x) = Fm (x) + l hm (xk ) = Fm (x) + l R̂xk , with k = (1, ..., p), (6)

where R̂ is the coefficient estimated from the robust model fitted in the previous step.

This model will have residuals rm+1 = y Fm+1 (x), which will be used in the next

iteration from item 3a on;

4. output the boosted model:

FM (x) = FM 1 (x) + l hM 1 (x). (7)

As in normal robust regression, predictors that enter the model will have a specific coefficient:

the difference lies in the fact that in a boosted model, a predictor is included only if at some

iteration m it was the one leading to the lowest residual (in step 3b). It follows that the same

predictor can enter the model multiple times, and that the earlier a variable enters the model, the

higher its influence on the final estimate of the target variable (at each iteration the algorithm is

improving its performance, therefore residuals will be smaller and smaller). At the final iteration

M, each predictor xk (with k = (1, ..., p)) will have a specific coefficient, which is defined as:

M
 l · R̂k
[m]
R̂k = · (km ⇤ = km ), (8)
m=1

where:
8

[m]
• R̂k is the robust regression coefficient estimated at iteration m;

• (km ⇤ = km ) is an indicator function that takes value 1 if at iteration m predictor xk was


included, and value 0 if it was not included.

If follows, logically, that the coefficient of a variable is updated only at the iterations in which

this variable was included, i.e. at iterations where this variable led to minimisation of the loss

function.

Boosted regression trees

If regression trees were to be used as base learner in place of robust linear regression, in item 3b

a regression tree would have to be fitted to the residuals. This gradient boosting algorithm has

consequently three tuning parameters: the number of iterations M and the learning rate l , as

defined in the previous section, and the tree depth D, as defined in section 2.3.

Gradient boosting where F (x) represents a regression tree leads to the following algorithm:

1. Select the maximum tree depth, D, the learning rate l and the number of iterations M;

2. Compute the average observed response, ȳ, and use it as the initial predicted value;

3. for m=0 to M:

(a) Compute the residual rm (the difference between the observed and the current predicted

values);

(b) Using the residuals rm as the response, fit a regression tree hm (x) of depth D for each

predictor separately, and pick the one that minimises the loss function;

(c) Update the predicted value by adding l hm (x) to the predicted values generated in the

previous step;

4. Output the boosted model.

It can be noticed that the steps of the above algorithm correspond to the ones in the boosted

robust linear regression, as they are examples of gradient boosting based on different base learners.

Hence, the choice of base learners influences the way in which predictions are computed, but not

the algorithm itself.

3 Methods

In this section, data and estimation methods are described. The focus is on understanding both

which model leads to the best imputation accuracy and whether the inclusion of multiple variables

results in a definite accuracy improvement compared to the one-predictor models.


9

3.1 Data description

Analyses are based on two datasets: the first dataset is from the Dutch Structural Business, where

274 units are recorded on six variables. These include Turnover, Number of employees, Cost of

purchases, Depreciation cost, Cost of employees and Other costs. Turnover and Number of em-

ployees are usually considered predictor variables, because they express a complementary mea-

sure of size of the business and are known from other sources (e.g., tax registers, chamber of

commerce), therefore they are rarely hampered with missing values.

The second dataset is from the Dutch Healthcare, where information on 841 healthcare institutions

are recorded. The kind of variables included is similar to the previous dataset, and they consist of

Turnover, Number of employees, Depreciation cost, Total costs, Wage salaries, Pension costs and

Other costs.

As is often the case with business data, all the above considered variables are right skewed and

strictly positive. A number of outliers was present in both datasets, and analyses were based both

on the complete dataset and on a subset of the data where the most obvious outliers were excluded

(five in the Dutch Structural Business dataset and one in the Dutch Healthcare dataset). Results

were compared to check consistency of results.

3.2 Estimation methods in the non-boosted models

A ratio model was fitted to the data for every combination of target-predictor variables. In each

model, the ratio R was estimated based on the wls, Huber and Tukey estimators.

To assess estimators’ accuracy on the out of sample predictions, data were divided in 80%

training data and 20% test data, where the test data served as induced missing values that were

used to test the accuracy of the fitted models. Namely, the true (known) values were hidden during

the model fitting, and used only in the determination of the model’s accuracy. The classical ratio

model was then fitted to the train data only, and used to predict the ’missing’ test data. Predictions

were computed following Equation 1, and the final model error for each estimator and variables’

combination was obtained by computing the accuracy measures presented in subsection 3.6.

3.3 Estimation methods in the boosted models

To assess how boosting affects imputation’s accuracy, five different models have been fitted to the

two datasets:
10

• GAMs (Generalized Additive Models);

• Regression Trees;

• RLMs (Robust Linear Models) with Huber estimator;

• RLMs (Robust Linear Models) with Tukey estimator.

For both datasets, each of the target variables (i.e. all variables excluding Turnover and Num-

ber of employees) was used as dependent variable. To understand the influence of the number of

predictor variables on the model performance, three different approaches have been implemented

and compared. Namely, the above five models have been fitted, for each dependent variable,

to: (i) one predictor only (Number of employees); (ii) both the predictors (Number of employees

and Turnover); (iii) all the available variables (four for the first dataset and five for the second one).

To enhance comparability of results and reduce randomness’s influence, the same train-test

split has been used in the estimation of all the different models, by the use of the caret R package

[12]. More information on this package can be found in Appendix A, together with a simple hand-

in model tuning example. Considering the three approaches and the number of variables involved,

the number of fitted models amounted to 48 for the first dataset (4 target variables x 4 models x 3

methods) and 60 (5 target variables x 4 models x 3 methods) for the second one. These correspond

to fit 12 models to each target variable, with a total of 108 models. Boosting was performed on a

number of 100 iterations for each of the models, and results were compared based on the accuracy

measures presented in subsection 3.6.

In each model variable importance was assessed, a measure that is based on the order and

the number of times each predictor entered the boosting algorithm. Considering the nature of

boosting (as explained in subsection 2.4), the first variable that enters the model is the one which

better minimises the loss function, i.e. the one that is better able to predict the corresponding

output variable. Therefore, variable importance is a function of the reduction in squared error,

also called risk reduction. In particular, the decrease in squared error for each predictor is summed

up at each boosting iteration (i.e., at each iteration, each predictor gets a variable importance

value). For each variable, these values are then averaged across the M iterations to generate an

overall variable importance score [6].

The self-frequency values were also inspected, which are the percentages representing the

relative number of times a variable is included in the boosting model (at each iteration M, one

variable is included). In a given model, the self-frequencies sum up to one, and the variable with

the highest self-frequency is not necessarily the one with the highest variable importance value:
11

for example, a predictor which is included only once at an earlier model iteration might bring

a higher improvement compared to another predictor which is included multiple times at later

stages. This is easily explained by the nature of boosting, where model’s residuals are sequentially

reduced: as the number of iterations increases, residuals will be smaller and smaller. Therefore,

if a variable is included multiple times at later stages, its influence on residuals’ reduction will be

lower, as residuals have already been reduced at all the previous iterations.

3.4 Comparison of boosted and non-boosted model

Given the focus on robust estimation and boosting, with the goal to compare models to their

boosted version, an extension of the classical ratio imputation model was introduced. Following

the methods of the previous section, a non-boosted ratio imputation model was fit with (i) one

predictor only (Number of employees), which correspond to the model fitted in subsection 3.2; (ii)

both the predictors (Number of employees and Turnover); (iii) all the available variables (4 for the

first dataset and 5 for the second one). For each target variable, 6 more models were fit (3 for

Huber estimation and 3 for Tukey estimation). Analyses were again based on the same train-test

data division, and as in the classical ratio model, the intercepts were excluded.

The boosted robust models were then compared to their non-boosted versions to assess whether

the higher model complexity induced by boosting has an impact on the models’ predictive ability.

Also, an overall accuracy comparison among boosted and non-boosted models was performed.

Given the large number of models, a visual comparison of the 18 models (12 boosted models and

6 non-boosted) was created, where the change in accuracy across the three methods (one, two or

all variables) can be visually inspected.

To check whether models with (1) one predictor compared to models with two predictors and

(2) all variables as predictors compared to models with two predictors led to a consistent accuracy

improvement, the percentage decrease of RMSE (Root Mean Squared Error) and MAPE (Mean

Absolute Percentage Error) has been investigated on the model that overall performed best. A

10% difference in accuracy has been chosen as a threshold to consider the improvement effective.

3.5 Means’ accuracy after imputation

The focus of Statistics Netherlands is often in the estimation of means and totals of the variables

of interest, where missing values are present. In this regard, the methods proposed in this paper

have been evaluated and compared to the standard ratio imputation procedure used by National

Statistics Institutes, with the goal of finding a method that leads to a post-imputation mean as
12

close as possible to the real population one. The standard among National Statistics Institutes is

to impute values based on a classical ratio model with either Turnover or Number of employees as

predictor variable.

The true (observed) mean of each target variable has been compared with the post-imputation

mean obtained with (1) classical ratio imputation; (2) the newly-proposed method (non-boosted

Tukey imputation with all variables included).

The true (observed) mean has been computed on the complete dataset (i.e., before the exclusion

of the test data), while the means for the classical imputation and the Tukey robust methods were

computed on the complete data where the actual test data values were replaced by the predictions

made by the corresponding models. Accuracy at the aggregate level has been based on the MAPE.

3.6 Accuracy measures

The predicted values were compared to the (known) true y-values to assess models’ performance.

At each iteration, accuracy was assessed based on the following measures:

- The Mean Absolute Percentage Error (MAPE), which provides a standardized accuracy

measure in terms of percentages. Lower values correspond to better predictions. It is defined as:

1 N |yi ŷi |
N iÂ
MAPE = ( ) ⇤ 100, (9)
=1 |yi |

where:

• yi is the actual observation;

• ŷi is the unit prediction (obtained using Equation 1);

• N is the sample size.

- The Root Mean Squared Error (RMSE), which corresponds to the standard deviation of

the residuals, is defined as:

s
1 N
N iÂ
RMSE = ( yi ŷi )2 . (10)
=1

It measures the average magnitude of the error, and the squared term causes high errors to have a

stronger influence on the final error estimate.


13

4 Results

Results will be compared, and estimation accuracy will be assessed, as presented in section 3: (1)

between the classical ratio imputation method and the use of robust statistics with one predictor

variable; (2) on the five boosted models, where each model has been fitted with one, two, or all

variables as predictors; (3) on the boosted and non-boosted robust models, where the non-boosted

ones have analogously been compared on one, two, or all variables.

As introduced in subsection 3.1, the standard among Statistical Offices is to use either Turnover

or Number of employees as predictor variable in the classical ratio model. For all models that

involved only one predictor variable, results are here presented for the predictor variable Number

of employees. Results obtained with the predictor variable Turnover are comparable and have

therefore been omitted. Also, detailed results have been reported for the target variable Other

costs from the Dutch Structural Business dataset; detailed numerical results for the other variables

can be found in Appendix B.

4.1 Comparison between classical estimation and robust statistics

At first, the wls, Huber and Tukey estimators were used to derive the ratio (R) estimates on the

complete data, based on the loss functions presented in Table 1. Results have been summarized

in Table 2 for the predictor variable Number of employees and the target variables in the Dutch

Structural Business Dataset.

Subsequently, the same models were fit to the train data, while the test data have been used

as ’induced’ missing values to assess model’s accuracy. The MAPE and RMSE were used to

assess accuracy among the different estimators, based on the comparison between the predicted

and actual target values. Table 3 includes accuracy results (where lower values indicate a lower

prediction error) for the target variables in the Dutch Structural Business dataset and the predictor

variable Number of employees. Models’ accuracy is very dependent on the predictive ability of

the predictor variable with respect to the target one: for example, it is clear that Number of em-

ployees is not a good predictor for the target variable Cost of purchases, while it is a good one for

Depreciation cost, leading to imputed values that are ⇡ 1.4% off the real ones.

Considering both datasets, robust estimation with one predictor variable compared to the wls

method led to a decrease in MAPE in seven target variables out of nine (⇡ 77%), with Tukey

estimation performing slightly better than the Huber one most of the times. When considering the

Tukey model and the predictor variable Number of Employees, the average (over all nine target

variables) decrease in MAPE compared to the classical wls method is of 0.67%.


14

4.2 Boosting

As explained in subsection 3.3, for each of the nine target variables accuracy results were com-

pared on three models: the simplest one where only one predictor variable is included, the second

one where both predictor variables enter the model and the third one which comprises both pre-

dictor and target variables. Boosted GAMs, Regression trees, Huber and Tukey robust models

were compared. The accuracy measures for each target variable are reported in Appendix B. Con-

sidering all target variables, only in two cases out of nine (⇡ 22%) a GAM with two predictors

performed better than the other models. In all the other cases, the inclusion of all variables as

predictors brought an improved imputation accuracy. In particular, the boosted GAM including all

variables outperformed all other models three times out of nine (⇡ 33%), while for the remaining

four target variables the best imputation model resulted to be the boosted-Huber (⇡ 44%).

Variable importance was assessed to understand which predictors brought the highest con-

tribution to the boosted models. In particular, the focus was on understanding whether the two

predictors that are usually used in ratio imputation models (Turnover and Number of employees)

are actually the ones that had the major impact in the boosting algorithms. On the boosted-Huber

models, for each of the nine target variables variable importance values were inspected: Turnover

appeared to be within the two variables with the highest contribution five times out of nine, while

Number of employees only one time out of nine (for the target variable Cost of employees, as they

are consistently measuring two related-quantities). It is noted that the two predictor variables never

appeared to be, together, the two best predictive inputs. This suggests, as it is confirmed by the

other results, that the inclusion of multiple predictors can be of great importance for an increased

model accuracy. An example of the variable importance values for the target variable Other costs

is reported in Figure 3, where both the variables’ inclusion in terms of frequency and the variables’

importance in terms of risk reduction are reported.

In this case, the variables Cost of employees and Depreciation cost resulted to be considerably

better predictors than the standard predictor variables Number of employees and Turnover. Cost

of employees was included in the boosted model 47% of the times (i.e., 47% of the M number of

iterations), and contributed to the 73% of the risk reduction, while Depreciation cost was included

in 19% of the iterations and contributed to 29% of the risk reduction. On the other hand, Turnover

has been included in the boosted model 11% of the times, contributing to only 0.03% of the

risk reduction; also, and even though Number of employees was included 17% of the times, its

contribution in terms of risk reduction was null (⇡ 0%). In this example it is clear how the self-

frequency of a variable is not directly related to its contribution in terms of risk reduction, as

explained in subsection 3.3.


15

4.3 Comparison between boosted and non-boosted models

To determine whether the increased-complexity of the boosted models conveyed a satisfactory in-

crease in accuracy, these have been compared to the non-boosted robust Huber and Tukey models.

In seven cases out of nine (⇡ 77%), robust non-boosted estimation resulted superior than all the

boosted methods both in terms of lowest RMSE and MAPE. In particular, Tukey estimator with

all variables as predictors resulted in the lowest imputation errors six times out of nine (⇡ 66%).

Consistently in all of the nine target variables (100%), the use of robust estimation (boosted or

non-boosted) resulted in the lowest accuracy errors. This implies that in the five cases where

boosted GAMs outperformed the other boosted models (see subsection 4.2), non-boosted robust

models resulted superior to the GAMs in all cases. Figure 4 provides a graphical representation

of how different models affect imputation’s accuracy for the target variable other cost, where it is

clear that a sharp improvement in MAPE has been obtained with the inclusion of all variables in

non-boosted Tukey estimation. Even though in some models the inclusion of multiple predictors

resulted in a decreased accuracy compared to the same model with one or two predictors (e.g.,

GAM in this example), the model that on the whole performed better has been shown to always

include all predictors, for all of the target variables. Detailed numerical results for the other eight

target variables can be found in Appendix B.

On the whole, non-boosted Tukey estimation with all variables as predictors resulted to be the

model with the highest accuracy and is therefore the suggested imputation model. On the nine

target variables, the MAPE for this model ranged between 0.14 and 4.47, with a mean of 1.74.

This implies that on average, this model will lead to individual imputation values that are 1.74%

off the real ones.

As introduced in subsection 3.4, the percentage decrease in RMSE and MAPE on the overall

best model (non-boosted Tukey) has been investigated between models with (1) one compared to

two predictors; (2) two compared to all predictors. The inclusion of two predictors compared to

the one predictor models resulted in an average percentage decrease in RMSE of 19.39% and of

27.46% in terms of MAPE. Furthermore, the inclusion of all variables compared to models with

only two predictors led, on average, to a 42.05% decrease in RMSE and a 43.69% decrease in

MAPE. Hence, the considered comparisons largely overstepped the 10% threshold from which

the improvements were considered effective.

In Table 4, an overview of the accuracy comparison based on the MAPE between the classical

wls model and the new method is presented. Only in one variable out of nine (⇡ 11%) the new

method did not resulted in an improved prediction accuracy. For all other variables, the improve-

ment ranges between 12.20% and 96.97%. The overall mean accuracy improvement, based on the

MAPE, is of 48.98%.
16

4.4 Means’ accuracy after imputation

The comparison of the post-imputation means between the classical and the newly proposed

method (non-boosted Tukey with all predictors) are reported in Table 5 for each target variable,

where Number of employees was used as predictor variable.

Only in four target variables out of nine (⇡ 44%) the new method resulted in a more precise

post-imputation mean. The post-imputation means obtained with the new method presented a

percentage difference (MAPE) compared to the true (observed) means that ranged between 0.03%

and 4.16%, with an average of 1.13%. On average, the post-imputation means obtained with this

method are 1.13% off the real means.

5 Conclusions and future research

This paper investigated novel methods to achieve a more precise ratio imputation of business miss-

ing data. The standard in National Statistical Offices is to impute missing values based on a ratio

model with one predictor variable (Turnover or Number of employees), where wls estimation is

usually performed.

At first, classical and robust techniques have been compared and discussed. Table 2 shows how

the use of robust statistics impacts the ratio estimate when only one predictor variable is involved.

Imputation performance suggests that the robust ratio models perform better than the classical

ratio estimator in 77% of the cases, although the improvement in MAPE brought by the robust

methods was only, on average, of 0.67%.

Next, the machine learning technique boosting has been applied to the case of interest, allowing for

a step-wise inclusion of multiple predictors. Boosted GAMs, Regression Trees, Tukey and Huber

models have been compared for each of the nine target variables separately, where each model has

been fitted with one, two and all variables as predictors with the goal to assess how the inclusion

of multiple variables affects imputation accuracy. Boosted models have also been compared to the

non-boosted robust models (Huber and Tukey) with one, two and all predictors. The most precise

model resulted to be the non-boosted Tukey with all variables in six target variables out of nine

(⇡ 66%), the non-boosted Huber in one target variable (⇡ 11%, where Huber estimates resulted

to be very close to the Tukey ones in all cases) and the boosted Huber in the remaining two vari-

ables (⇡ 22%). It follows that robust estimation (boosted or non-boosted) with the inclusion of all

predictors outperformed the other methods for all the considered target variables.

On the whole, non-boosted Tukey estimation led to the best model and is therefore suggested as

future method for business missing data imputation. This model leads to individual imputed val-

ues that are, on average, 1.74% off compared to the real ones. When all predictors were included,
17

this model led to an average percentage decrease in RMSE of 42.05% and of 43.69% in terms of

MAPE, compared to the same models with the two standard predictors only. The present find-

ings confirm that to achieve an higher imputation performance, the inclusion of information from

other available variables is of definite importance. Even though the initial choice of predictors

was based on practical reasons (see subsection 3.1) and on the standard choice among Statistical

Offices, it is reasonable that the inclusion of as much information as available enables the model

to catch multiple aspects of the data, leading to an improved imputation model. Also, the inspec-

tion of variable importance in the boosted models showed that the two standard predictors never

appeared, together, to be the best predictors.

The newly proposed method, when compared to the classical ratio model where either Number of

employees or Turnover are used as predictor variables, brought on average an increased accuracy

in MAPE of 48.98%. Hence, individual prediction of missing values can greatly benefit of the this

new method.

Considering that the focus of National Statistical Institutes is often on means and totals of the

target variables of interest, in the end the newly proposed method has also been compared to the

standard ratio model at the aggregate level. When Number of employees was used as predictor

variable, only in ⇡ 44% of the cases the new method led to a more precise post-imputation mean.

It is of interest noting the differences between results at the individual level and the ones at the

aggregate level: the very high improvement in accuracy obtained with Tukey imputation is not re-

flected in the accuracy of the same method at the aggregate level. This might be due to the fact that

the inclusion of multiple predictors favors the model to catch multiple and very precise aspects of

the data at the individual level (where each business is considered as unit), while at the aggregate

level a single variable reflecting the size of the businesses provides a simpler, yet more effective,

measure to obtain aggregate estimates.

Future research could assess whether the present findings can be extended to different kinds

of data. Also, as described in subsection 2.2, in robust estimation the proposed tuning constants k

have been shown to be optimal for symmetrical distributions [10]. Considering the skewed nature

of the examined data, future research should consider a different procedure for the determination

of the tuning constants. Lastly, it is noted that all the boosted models have been fitted for a fixed

number of iterations (M = 100); as explained in section 2.4, the maximum boosting performance

is obtained when the model is run for the optimal number of iteration mstop , usually obtained

through cross-validation. Although this process is implemented in the caret R package, it is lim-

ited to the Gaussian family. It can be of great importance for future studies to develop a procedure

that extends the automatic choice of the best mstop value to non-Gaussian families and in particular,
18

considering the promising results obtained with robust statistics, to the Huber and Tukey families.

In addition, different levels of aggregation should be investigated in future research, to understand

whether the great improvements obtained at the individual level can be extended to other levels of

aggregation (e.g., splitting up small and big businesses).

In conclusion, this paper provides evidence of a method that effectively improves imputation

accuracy of business missing data at the individual level. When detailed information is the focus

(e.g., modeling, individual predictions, etc.), the new method brings on average a 48.98% accu-

racy improvement (based on the MAPE) compared to the standard method. On the other hand, at

higher levels of aggregation (i.e., estimation of the means or totals), the classical wls imputation

method is suggested, for its simplicity and its greater accuracy at the aggregate level.
19

Appendices
21

Appendix A

The caret R package

The caret (classification and regression training) package [12] was created by Max Kuhn in 2008,

with the goal of having a comprehensive tool for complex regression and classification models

in R [16]. Khun developed the package based on the available R packages for prediction, so

that a single interface could serve a large number of models based on the same syntax. The idea

was born to solve the inconsistencies among the various packages that perform predictive models:

different tuning parameters settings and slightly different prediction methods lead to confusion

and difficult model comparison. The caret package solved these issues by providing an unified

interface and syntax, as well as an increased computational efficiency by the use of parallel pro-

cessing (the computational process is split into several simultaneous processes that run in parallel).

On a variety of predefined models (112, including the most used lm, glm, etc.), caret supports:

• data splitting and pre-processing (centering, scaling, imputation with different methods,

etc.);

• model fitting and performance assessment:

– the function trainControl allows to define the resampling method (e.g., bootstrap-

ing, cross-validation, repeated cross-validation, etc.) to be used during the model fit-

ting process;

– the function train enables the user to fit each of the 112 models based on a unified

syntax;

– the train function automatically performs hyperparameters optimisation based on

the resampling method specified in trainControl. Additionally, if the user wants

to fit the model on a specified set of tuning parameters, these can be provided in the

expand.grid command;
22

– the user can specify the preferred accuracy measure(s) to assess model performance,

which will be evaluated at each different combination of the tuning parameters. The

default metric for regression is the RMSE;

• predictions can be obtained through the predict function, which (unless specified differ-

ently) automatically uses the combination of model hyperparamethers which led to the best

accuracy measure.

It follows that caret’s model fitting procedure allows to easily check whether, e.g., a boosted

tree model performs better with 2, 4 or 10 maximum tree depth. Also, straightforward plot-

ting options are available to visually inspect how models’ accuracy is influenced by the different

combinations of tuning parameters. The influence in accuracy can be compared both within and

between models: from one side, as just mentioned, different combinations of tuning parameters

can be easily compared on the same model. On the other hand, different models run on the same

dataset can be likewise compared. Another very attractive feature of this package is that it allows to

compare multiple models based on the same resampling profiles. For instance, when k-fold cross

validation is implemented, the data sub-samples can be set to be exactly the same among different

models. In this way, the randomness induced by the resampling procedure plays a smaller role in

the determination of which model performs best, leading to a clear and reliable model comparison.

The comprehensiveness of the caret package, together with its other attractive features and the

consistent way in which models are assessed, makes it an useful guidance for the researcher on

model choice.

A minimal example of model fitting is provided in Listing A.1:

L ISTING A.1: Example of a boosted regression tree model fitting with caret

tr < t r a i n C o n t r o l ( method = " cv " , number = 1 0 ) }

grid < e x p a n d . g r i d ( s h r i n k a g e = s e q ( 0 . 1 , 1 , by = 0 . 4 ) ,
i n t e r a c t i o n . depth = c (2 , 4 , 10) ,
n . t r e e s = c (100 , 300 , 500 , 1000) )

gbm_model < t r a i n ( y ~ . , d a t a = t r a i n _ d a t a , method = " gbm " ,


trControl = tr , tuneGrid = grid ) }

gbm_predictions < p r e d i c t ( gbm_model , t e s t _ d a t a )


23

Appendix B

Accuracy comparison among models

Accuracy measures in terms of RMSE and MAPE are here presented for each model (one predic-

tor, two predictors and all variables) as described in subsection 4.2.

Results for the target variables in the Dutch Healthcare dataset are presented in Table B.1, Ta-

ble B.3, Table B.2, Table B.4 and Table B.5, while results for the target variables in the Dutch

Structural Business dataset are presented in Table B.6, Table B.7, Table B.8 and Table B.9.

TABLE B.1: Imputation error assessment for the target variable Wage salaries

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 616.15 546.27 297.99 2.60 1.79 1.21
Regression Tree* 671.93 704.56 461.17 2.87 2.22 1.57
Huber* 621.65 533.48 381.53 2.45 1.77 1.45
Tukey* 635.82 525.85 451.09 2.61 1.87 1.76
Huber 631.92 546.60 139.68 2.45 1.81 0.63
Tukey 621.61 500.90 125.20 2.44 1.77 0.58
* Models that have been boosted.

TABLE B.2: Imputation error assessment for the target variable Other costs

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 476.68 370.55 321.40 1.90 1.37 1.18
Regression Tree* 471.13 326.51 276.62 1.99 1.28 1.20
Huber* 531.21 386.13 361.06 2.00 1.37 1.22
Tukey* 552.77 417.37 414.15 2.04 1.42 1.39
Huber 527.76 386.26 162.75 1.98 1.39 0.71
Tukey 534.98 395.54 135.31 1.99 1.39 0.66
* Models that have been boosted.
24

TABLE B.3: Imputation error assessment for the target variable Pension costs

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 58.05 64.01 51.96 0.24 0.21 0.15
Regression Tree* 60.71 71.50 68.51 0.29 0.27 0.21
Huber* 54.62 63.19 51.38 0.22 0.21 0.14
Tukey* 54.06 58.99 50.40 0.23 0.20 0.16
Huber 55.69 61.20 49.04 0.22 0.21 0.13
Tukey 54.47 51.97 48.63 0.22 0.18 0.14
* Models that have been boosted.

TABLE B.4: Imputation error assessment for the target variable Total costs

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 1073.19 917.71 647.37 4.91 2.10 1.55
Regression Tree* 1159.11 1030.65 684.04 5.41 2.50 1.88
Huber* 1133.68 856.86 374.16 4.72 2.11 1.25
Tukey* 1188.17 856.63 669.72 5.06 2.76 2.77
Huber 1131.21 894.77 173.21 4.69 2.15 0.72
Tukey 1129.77 855.63 139.68 4.69 2.13 0.64
* Models that have been boosted.

TABLE B.5: Imputation error assessment for the target variable Cost deprecia-
tions

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 93.85 91.47 91.06 0.42 0.38 0.38
Regression Tree* 92.92 90.61 90.76 0.42 0.39 0.39
Huber* 92.72 83.96 82.90 0.41 0.35 0.35
Tukey* 93.56 85.13 85.60 0.41 0.35 0.36
Huber 93.39 83.92 90.41 0.41 0.36 0.37
Tukey 95.04 85.66 87.47 0.41 0.35 0.36
* Models that have been boosted.

TABLE B.6: Imputation error assessment for the target variable Other costs

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 1566.72 1556.33 1906.90 9.40 8.58 9.22
Regression Tree* 1753.75 1852.52 1881.12 10.29 10.65 9.53
Huber* 1947.34 2014.20 1988.69 9.23 9.22 8.77
Tukey* 2015.31 2023.71 2053.70 9.19 9.07 8.83
Huber 1855.51 1987.76 973.39 8.08 8.59 4.56
Tukey 1838.57 1912.61 705.31 8.02 8.23 4.06
* Models that have been boosted.
25

TABLE B.7: Imputation error assessment for the target variable Cost purchases

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 35739.16 1881.65 1493.52 161.15 12.73 9.89
Regression Tree* 35933.39 12399.51 12399.51 166.54 37.94 37.94
Huber* 36058.81 5360.92 5353.10 152.72 24.10 23.22
Tukey* 36262.97 8256.81 8256.81 151.52 32.07 32.07
Huber* 36800.83 3137.41 679.59 144.86 14.23 4.47
Tukey* 37049.29 3578.34 690.35 146.17 15.42 4.47
* Models that have been boosted.

TABLE B.8: Imputation error assessment for the target variable Cost employees

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 706.89 666.94 721.20 4.57 4.35 4.54
Regression Tree* 783.23 818.89 792.63 5.29 5.54 5.34
Huber* 714.69 777.34 627.07 5.44 5.03 4.59
Tukey* 733.68 792.64 672.45 5.38 5.20 4.92
Huber 647.90 720.59 474.88 4.57 4.37 3.42
Tukey 696.31 815.45 486.43 4.36 4.42 3.39
* Models that have been boosted.

TABLE B.9: Imputation error assessment for the target variable Cost deprecia-
tions

RMSE MAPE
One Two All One Two All
Boosted model
predictor predictors variables predictor predictors variables
Generalized Additive Model* 234.64 250.34 243.31 1.51 1.56 1.49
Regression Tree* 225.63 235.03 238.00 1.54 1.59 1.58
Huber* 225.35 225.35 216.07 1.49 1.49 1.39
Tukey* 229.70 229.70 221.00 1.49 1.49 1.40
Huber 225.35 225.35 216.07 1.49 1.49 1.39
Tukey 229.70 229.70 221.00 1.49 1.49 1.40
* Models that have been boosted.
27

Bibliography

[1] Akaike, H. (1998). A New Look at the Statistical Model Identification, pp. 215–222. New

York, NY: Springer New York.

[2] Allison, P. D. (2002). Quantitative Applications in the Social Sciences: Missing data. Series:

Quantitative applications in the Social Sciences. Thousand Oaks, CA, USA: Sage Publications,

Inc.

[3] Bühlmann, P. and T. Hothorn (2007, 11). Boosting algorithms: Regularization, prediction and

model fitting. Statistical Science 22(4), 477–505.

[4] de Waal, T., J. Pannekoek, and S. Scholtus (2011). Handbook of Statistical Data Editing and

Imputation. Wiley.

[5] Friedman, J. H. (2001, 10). Greedy function approximation: A gradient boosting machine.

Ann. Statist. 29(5), 1189–1232.

[6] Friedman, J. H. (2002, February). Stochastic gradient boosting. Computational Statistics &

Data Analysis 38(4), 367–378.

[7] Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical

Statistics 35(1), 73–101.

[8] Kalton, G. and D. Kasprzyk (1986). The treatment of missing survey data. Survey Methodol-

ogy 12(1), 1–16.

[9] Kang, H. (2013). The prevention and handling of the missing data. Korean journal of anes-

thesiology 64(5), 402–6.

[10] Kelly, G. E. (1992). Robust Regression Estimators-The Choice of Tuning Constants. The

Statistician 41(3), 303.

[11] Khun, M. (2013). Applied predictive modeling. New York : Springer.

[12] Kuhn, M. (2008). Building predictive models in r using the caret package. Journal of

Statistical Software, Articles 28(5), 1–26.


28

[13] Little, R. J. A. and D. B. Rubin (1986). Statistical Analysis with Missing Data. New York,

NY, USA: John Wiley & Sons, Inc.

[14] Palmer, S. and C. Jones (1966). A look at alternate imputation procedures for cps noninter-

views. Bureau of the Census, 66–459.

[15] Qin, Y., S. Zhang, X. Zhu, J. Zhang, and C. Zhang (2007). Semi-parametric optimization for

missing data imputation. Applied Intelligence 27(1), 79–88.

[16] R Core Team (2013). R: A language and environment for statistical computing.

[17] Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the 16th Inter-

national Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, San Francisco, CA,

USA, pp. 1401–1406. Morgan Kaufmann Publishers Inc.

[18] Shao, J. (2000). Cold deck and ratio imputation. Survey Methodology 26(1), 79–85.

[19] Song, Y.-Y. and Y. Lu (2015). Decision tree methods: applications for classification and

prediction. In Shanghai archives of psychiatry.

[20] Takahashi, M. (2017). Implementing multiple ratio imputation by the emb algorithm. Journal

of Modern Applied Statistical Method 16(1), 657–673.

[21] Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley series in behavioral science.

Reading, MA, USA: Addison-Wesley.

[22] Wang, Y.-G. and Z. D. Bai (2007). Robust estimation using the huber function with a data-

dependent tuning constant. Journal of Computational and Graphical Statistics 16(2), 468–481.
29

F IGURE 1: Loss function of the Huber and Tukey estimators


30

TABLE 1: Loss function of the Wls, Huber and Tukey estimators

Loss function
Wls Lw (e) = 12 ( p e 2 )2
s x
(
1 2
Huber LH (e) = 2e |e|  k
1 2
k|e| 2k |e| > k
(
k2 {1 [1 ( ke )2 ]3 }/6 |e|  k
Tukey LT (e) =
k2 /6 |e| > k
Where e = y f (x) are the model residuals and k is a predefined
constant.
31

Depth 0

Depth 1

Depth 2

Depth 3

F IGURE 2: Example of a regression tree where maximum tree depth is 3


32

TABLE 2: Comparison of the ratio R based on different estimators, where Num-


ber of employees was used as predictor variable

R estimator
Target variable wls Huber Tukey
Cost of purchases 37.86 37.65 37.29
Cost of employees 0.87 0.83 0.82
Depreciation cost 0.0070 0.0072 0.0076
Other costs 33.01 29.01 27.31
33

TABLE 3: Imputation error assessment on the predictor variable Number of Em-


ployees, based on MAPE and RMSE

RMSE MAPE
Target variable wls Huber Tukey wls Huber Tukey
Cost of purchases 36310.29 36800.83 37049.29 147.55 144.86 146.17
Cost of employees 643.72 647.90 676.31 4.42 4.56 4.36
Depreciation cost 222.23 228.90 231.60 1.39 1.40 1.42
Other costs 1493.98 1855.51 1838.57 7.83 8.08 8.01
34

Cost of employees
Cost_personnel
sel.freq:
sel. freq:~0.47
~0.47

Cost_depreciations
Depreciation cost
sel.
sel.freq:
freq:~0.19
~0.19
Variables
Variables

Turnover
Turnover
sel. freq:
sel. freq: ~0.11
~0.11

Number of employees
Employees
sel.freq:
sel. freq: ~0.17
~0.17

Cost of purchases
Costs_purchases
sel. freq:
sel. freq: ~0.06
~0.06

0
0.0000000 0.25
0.2498595 0.5
0.4997190 0.75
0.7495785 1
0.9994380

Risk
Risk reduction
reduction

F IGURE 3: Variable importance plot for the target variable Other costs (Dutch
Structural Business dataset), reporting risk reduction and self-frequencies values
35

RMSE
RMSE MAPE
MAPE

● ●

● ●

2000
2000 ●



● ●
10
10

● ●




● ● ●

● ●

Model
Model
● ●
1500 8 ●
1500 ● Boosted
Boosted GAM
GAM
● Boosted
Boosted Regression
Regression Tree
Tree
● Boosted
Boosted Huber
Huber
● Boosted
Boosted Tukey
Tukey
● Huber
Huber
● Tukey
Tukey

66

1000
1000

● ●
4

One
Onepredictor
predictor Two
Two predictors
predictors All variables
All variables One
One predictor
predictor Two predictors
Two predictors All variables
All variables

F IGURE 4: Accuracy results on the target variable Other costs (Dutch Structural
Business dataset)
36

TABLE 4: % accuracy improvement in individual values prediction brought by


Tukey imputation

Accuracy
Variable MAPE improvement (%)
Classical Tukey robust
ratio imputationa imputationb

Dutch Healthcare Dataset


Other costs 1.76 0.66 62.5%
Wages salaries 2.39 0.58 75.73%
Depreciation costs 0.41 0.36 12.20%
Total costs 4.45 0.64 85.62%
Pension costs 0.22 0.14 36.36%
Dutch Structural Business Dataset
Cost purchases 147.55 4.47 96.97%
Employees costs 4.42 3.39 23.30%
Depreciation costs 1.39 1.40 0%
Other costs 7.83 4.06 48.15%
a Where Number of employees was used as predictor variable.
b Where all variables were included in the model.
37

TABLE 5: Change in target variable’s mean after imputation and MAPE based
on Tukey imputation. Bold values indicate the post-imputation mean closest to
the real one.

Variable Mean MAPE


True Classical Tukey robust
(observed) ratio imputationa imputationb

Dutch Healthcare Dataset


Other costs 542.30 537.31 544.26 0.36%
Wages cost 1162.70 1163.10 1169.02 0.54%
Depreciation costs 65.45 63.12 62.78 4.16%
Total costs 2115.48 2108.52 2108.58 0.33%
Pension costs 91.29 92.16 91.78 0.53%
Dutch Structural Business Dataset
Cost purchases 23764.00 23008.86 23757.62 0.03%
Employees costs 2132.00 2125.07 2114.62 0.82%
Depreciation costs 254.41 249.47 247.22 2.87%
Other costs 1994.00 1993.07 1982.88 0.56%
a Where Number of employees was used as predictor variable.
b Where all variables were included in the model.

You might also like