2110.01889

SUBMITTED TO THE IEEE, JUNE 2022 1
Deep Neural Networks and Tabular Data: A Survey

Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug,
Martin Pawelczyk and Gjergji Kasneci
Abstract—Heterogeneous tabular data are the most commonly contrast to image or language data – are heterogeneous, leading
used form of data and are essential for numerous critical and to dense numerical and sparse categorical features. Furthermore,
computationally demanding applications. On homogeneous data the correlation among the features is weaker than the one
sets, deep neural networks have repeatedly shown excellent
performance and have therefore been widely adopted. However, introduced through spatial or semantic relationships in image
their adaptation to tabular data for inference or data generation or speech data. Hence, it is necessary to discover and exploit
tasks remains highly challenging. To facilitate further progress relations without relying on spatial information [9]. Therefore,
arXiv:2110.01889v3 [cs.LG] 29 Jun 2022
in the field, this work provides an overview of state-of-the-art Kadra et al. called tabular data sets the last “unconquered
deep learning methods for tabular data. We categorize these castle” for deep neural network models [10].
methods into three groups: data transformations, specialized
architectures, and regularization models. For each of these Heterogeneous data are the most commonly used form of
groups, our work offers a comprehensive overview of the main data [7], and it is ubiquitous in many crucial applications,
approaches. Moreover, we discuss deep learning approaches for such as medical diagnosis based on patient history [11]–[13],
generating tabular data, and we also provide an overview over predictive analytics for financial applications (e.g., risk analysis,
strategies for explaining deep models on tabular data. Thus, our estimation of creditworthiness, the recommendation of invest-
first contribution is to address the main research streams and
existing methodologies in the mentioned areas, while highlighting ment strategies, and portfolio management) [14], click-through
relevant challenges and open research questions. Our second rate (CTR) prediction [15], user recommendation systems [16],
contribution is to provide an empirical comparison of traditional customer churn prediction [17], [18], cybersecurity [19], fraud
machine learning methods with eleven deep learning approaches detection [20], identity protection [21], psychology [22], delay
across five popular real-world tabular data sets of different sizes estimations [23], anomaly detection [24], and so forth. In
and with different learning objectives. Our results, which we
have made publicly available as competitive benchmarks, indicate all these applications, a boost in predictive performance and
that algorithms based on gradient-boosted tree ensembles still robustness may have considerable benefits for both end users
mostly outperform deep learning models on supervised learning and companies that provide such solutions. Simultaneously,
tasks, suggesting that the research progress on competitive deep this requires handling many data-related pitfalls, such as noise,
learning models for tabular data is stagnating. To the best of our impreciseness, different attribute types and value ranges, or the
knowledge, this is the first in-depth overview of deep learning
approaches for tabular data; as such, this work can serve as missing value problem and privacy issues.
a valuable starting point to guide researchers and practitioners Meanwhile, deep neural networks offer multiple advantages
interested in deep learning with tabular data. over traditional machine learning methods. First, these methods
Index Terms—Deep neural networks, Tabular data, Heteroge- are highly flexible [25], allow for efficient and iterative
neous data, Discrete data, Tabular data generation, Probabilistic training, and are particularly valuable for AutoML [26]–[31].
modeling, Interpretability, Benchmark, Survey Second, tabular data generation is possible using deep neural
networks and can, for instance, help mitigate class imbalance
problems [32]. Third, neural networks can be deployed for
I. I NTRODUCTION
multimodal learning problems where tabular data can be one
Ever-increasing computational resources and the availability of many input modalities [28], [33]–[36], for tabular data
of large, labelled data sets have accelerated the success of deep distillation [37], [38], for federated learning [39], and in many
neural networks [1], [2]. In particular, architectures based on more scenarios.
convolutions, recurrent mechanisms [3], or transformers [4] Successful deployments of data-driven applications require
have led to unprecedented performance in a multitude of do- solving several tasks, among which we identified three core
mains. Although deep learning methods perform outstandingly challenges: (1) inference (2) data generation, and (3) in-
well for classification or data generation tasks on homogeneous terpretability. The most crucial task is inference which is
data (e.g., image, audio, and text data), tabular data still pose concerned with making predictions based on past observations.
a challenge to deep learning models [5]–[8]. Tabular data – in While a powerful predictive model is critical for all the
applications mentioned in the previous paragraph, the interplay
All authors are with the Data Science and Analytics Research (DSAR) group
at the University of Tübingen, 72070 Tübingen, Germany. Gjergji Kasneci is between tabular data and deep neural networks goes beyond
also affiliated with Schufa Holding AG, 65201 Wiesbaden, Germany. simple inference tasks. Before a predictive model can even
Corresponding authors: be trained, the training data usually needs to be preprocessed.
[email protected]
[email protected] This is where data generation plays a crucial role, as one
© 2022 IEEE. Personal use of this material is permitted. Permission from of the standard deployment steps involves the imputation of
IEEE must be obtained for all other uses, in any current or future media, missing values [40]–[42] and the rebalancing of the data
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to servers set [43], [44] (i.e., equalizing sample sizes for different
or lists, or reuse of any copyrighted component of this work in other works. classes). Furthermore, it might be simply impossible to use
the actual data due to privacy concerns, e.g., in financial tabular data is presented in Section VI. In Section VII, we
or medical applications [45], [46]. Thus, to tackle the data provide an extensive empirical comparison of machine and
preprocessing and privacy challenges, probabilistic tabular data deep learning methods on real-world data, that also involves
generation is essential. Finally, with stricter data protection model size, runtime, and interpretability. In Section VIII, we
laws such as California Consumer Privacy Act (CCPA) [47] summarize the state of the field and give future perspectives.
and the European General Data Protection Regulation (EU Finally, we outline several open research questions before
GDPR) [48], which both mandate a right to explanations for concluding in Section IX.
automated decision systems (e.g., in the form or recourse
[49]), interpretability is becoming a key aspect for predictive C ONTENTS
models used for tabular data [50], [51]. During deployment,
interpretability methods also serve as a valuable tool for model I Introduction 1
debugging and auditing [52].
Evidently, apart from the the core challenges of inference, II Related Work 3
generation, and interpretability, there are several other important
subfields, such as working with data streams, distribution shifts, III Tabular Data and Deep Neural Networks 3
as well as privacy and fairness considerations that should not be III-A Definitions . . . . . . . . . . . . . . . . 3
neglected. Nevertheless, to navigate the vast body of literature, III-B A Brief History of Deep Learning on
we focus on the identified core problems and thoroughly review Tabular Data . . . . . . . . . . . . . . . 4
the state of the art in this work. We will briefly discuss the III-C Challenges of Learning With Tabular Data 4
remaining topics at the end of this survey. III-D Unified Taxonomy . . . . . . . . . . . . 5
Beyond reviewing current literature, we think that an exhaus-
tive comparison between existing deep learning approaches IV Deep Neural Networks for Tabular Data 5
for heterogeneous tabular data is necessary to put reported IV-A Data Transformation Methods . . . . . 5
results into context. The variety of benchmarking data sets IV-A1 Single-Dimensional Encoding 5
and the different setups often prevent comparison of results IV-A2 Multi-Dimensional Encoding 6
across papers. Additionally, important aspects of deep learning IV-B Specialized Architectures . . . . . . . . 7
models, such as training and inference time, model size, and IV-B1 Hybrid Models . . . . . . . . 7
interpretability, are usually not discussed. We aim to bridge IV-B2 Transformer-based Models . 8
this gap by providing a comparison of the surveyed inference IV-C Regularization Models . . . . . . . . . . 9
approaches with classical – yet very strong – baselines such as
XGBoost [53]. We open-source our code, allowing researchers V Tabular Data Generation 9
to reproduce and extend our findings. V-A Methods . . . . . . . . . . . . . . . . . 9
In summary, the aims of this survey are to provide: V-B Assessing Generative Quality . . . . . . 10
1) a thorough review of existing scientific literature on deep
learning for tabular data; VI Explanation Mechanisms for Deep Learning
2) a taxonomic categorization of the available approaches with Tabular Data 11
for classification and regression tasks on heterogeneous VI-A Feature Highlighting Explanations . . . 11
tabular data; VI-B Counterfactual Explanations . . . . . . 11
3) a presentation of the state of the art and promising paths
towards tabular data generation; VII Experiments 12
4) an overview of existing explanation approaches for deep VII-A Data Sets . . . . . . . . . . . . . . . . . 12
models for tabular data; VII-B Open Performance Benchmark on Tabu-
5) an extensive empirical comparison of traditional machine lar Data . . . . . . . . . . . . . . . . . 12
learning methods and deep learning models on multiple VII-B1 Hyperparameter Selection . . 12
real-world heterogeneous tabular data sets; VII-B2 Data Preprocessing . . . . . 13
6) a discussion on the main reasons for the limited success VII-B3 Reproducibility and Extensi-
of deep learning on tabular data; bility . . . . . . . . . . . . . 13
7) a list of open challenges related to deep learning for VII-B4 Results . . . . . . . . . . . . 13
tabular data. VII-C Run Time Comparison . . . . . . . . . 13
Accordingly, this survey is structured as follows: We discuss VII-D Interpretability Assessment . . . . . . . 13
related works in Section II. To introduce the reader to the
field, in Section III, we provide definitions of the key terms, VIII Discussion and Future Prospects 15
a brief outline of the domain’s history, and propose a unified VIII-A Summary and Trends . . . . . . . . . . 15
taxonomy of current approaches to deep learning with tabular VIII-B Open Research Questions . . . . . . . . 16
data. Section IV covers the main methods for modelling
tabular data using deep neural networks. Section V presents an IX Conclusion 17
overview on tabular data generation using deep neural networks.
An overview of explanation mechanisms for deep models for References 17
II. R ELATED W ORK Deep Learning for Tabular Data

To the best of our knowledge, there is no study dedicated
exclusively to the application of deep neural networks to
Data Transformation Methods
tabular data, spanning the areas of supervised and unsupervised
learning, data synthesis, and interpretability. Prior works
cover some of these aspects, but none of them systematically Single-Dimensional Encoding
discusses the existing approaches in the broadness of this
survey. Multi-Dimensional Encoding
However, there are some works that cover parts of the domain.
There is a comprehensive analysis of common approaches
for categorical data encoding as a preprocessing step for Specialized Architectures
deep neural networks by Hancock & Khoshgoftaar [54].
The authors compared existing methods for categorical data
encoding on various tabular data sets and different deep learning Hybrid Models
architectures. We also discuss the key categorical data encoding
methods in Section IV-A1. Fully differentiable
A recent survey by Sahakyan et al. [50] summarizes
explanation techniques in the context of tabular data. Hence,
Partly differentiable
we do not provide a detailed discussion of explainable machine
learning for tabular data in this paper. However, for the sake
of completeness, we present some of the most relevant works Transformer-based Models
in Section VI and highlight open challenges in this area.
Gorishniy et al. [55] empirically evaluated a large number of
state-of-the-art deep learning approaches for tabular data on a Regularization Models
wide range of data sets. The authors demonstrated that a tuned
deep neural network model with a ResNet-like architecture [56]
shows comparable performance to some state-of-the-art deep Fig. 1: Unified taxonomy of deep neural network models for
learning approaches for tabular data. heterogeneous tabular data.
Recently, Shwartz-Ziv & Armon [7] published a study
on several different deep models for tabular data including
TabNet [5], NODE [6], Net-DNF [57]. Additionally, they that learns the value of the model parameters W (i.e., the
compared deep learning approaches to gradient boosting “weights” of a neural network) that results in the best approx-
decision tree algorithms regarding accuracy, training effort, imation of the true underlying and unknown function f . In
inference efficiency, and hyperparameter optimization time. this case, x is a multi-dimensional data sample (i.e., x ∈ Rn )
They observed that deep models had the best results on their with corresponding target y (where typically, y ∈ Rk for k
chosen data sets, however, not one single deep model could classes and y ∈ R for regression tasks) from a data set of
outperform all the others in general. The deep models were tuples {(xi , yi )}i∈I . The network is called feed-forward if the
challenged by gradient boosting decision trees, leading the input information flows in one direction to the output without
authors to conclude that efficient tabular data modelling using any feedback connections.
deep neural networks is still an open research problem. In
Throughout this survey we focus on heterogeneous data
the face of this evidence, we aim to integrate the necessary
which usually contains a variety of attribute types – These
background for future research on the inference problem and
include both continuous and discrete attributes of different type
on the intertwined challenges of generation and explainability
(e.g., binary values, ordinal values, high-cardinality categorical
into a single work.
values). This is fundamentally different from homogeneous
data modalities, such as images, audio, or text data where only
III. TABULAR DATA AND D EEP N EURAL N ETWORKS a single feature type is present.
A. Definitions Categorical variables are an attribute type of particular
importance. According to Lane’s definition [58], categorical
In this section, we give definitions for central terms used in
variables are qualitative values. They “do not imply a numerical
this work. We also provide pointers to the original works for
ordering”, unlike quantitative values, which are “measured in
more detailed explanations of the methods.
terms of numbers”. Usually, a categorical variable can take one
The key concept in this survey is a (deep) neural network.
out of a limited set of values. Examples of typical categorical
Unless stated otherwise we use this concept as a synonym
variables include gender, user id, product type and topic.
for feed-forward networks, as described by [2], and name the
concrete model whenever we deviate from this concept. A deep Tabular data, sometimes also called structured data [59], is
neural network defines a mapping f , ˆ a subcategory of the heterogeneous data format that is usually
presented in a table [60] with data points as rows and features
y = f (x) ≈ fˆ(x; W ), (1) as columns. In summary, for the scope of this work, we refer
Age Education Occupation Sex Income

of tabular data [5], a lot of research effort has focused on
transformer architectures that can be successfully applied to
39 Bachelors Adm-clerical Male ≤50K
50 Bachelors Exec-managerial Male >50K very large tabular data sets.
38 HS-grad Handlers-cleaners Male ≤50K
53 11th Handlers-cleaners Male ≤50K
28 Bachelors Prof-specialty Female >50K
C. Challenges of Learning With Tabular Data
TABLE I: An example of a heterogeneous tabular data set. As we have mentioned in Section II, deep neural networks
Here we show five samples with selected variables from the often perform less favourably compared to more traditional
Adult data set [61]. Section VII-A provides further details on machine learning methods (e.g., tree-based methods) when
this data set. dealing with tabular data. However, it is often unclear why
deep learning cannot achieve the same level of predictive quality
as in other domains such as image classification and natural
to a data set with a fixed number of features that are either language processing. In the following, we identify and discuss
continuous or categorical as tabular. Each data point can be four possible reasons:
understood as a row in the table, or – taking a probabilistic 1) Low-Quality Training Data: Data quality is a common
view – as a sample from the unknown joint distribution. An issue with real-world tabular data sets. They often
illustrative example of five rows of a heterogeneous, tabular include missing values [40], extreme data (outliers) [24],
data is provided in Table I. erroneous or inconsistent data [76], and have small overall
size relative to the high-dimensional feature vectors
generated from the data [77]. Also, due to the expensive
B. A Brief History of Deep Learning on Tabular Data
nature of data collection, tabular data are frequently
Tabular data are one of the oldest forms of data to be class-imbalanced. These challenges affect all machine
statistically analysed. Before digital collection of text, images, learning algorithms; however, most of the modern deci-
and sound was possible, almost all data were tabular [62]– sion tree-based algorithms can handle missing values or
[64]. Therefore, it was the target of early machine learning different/extreme variable ranges internally by looking
research [65]. However, deep neural networks became popular for appropriate approximations and split values [53], [78],
in the digital age and were further developed with a focus [79].
on homogeneous data. In recent years, various supervised, 2) Missing or Complex Irregular Spatial Dependencies:
self-supervised, and semi-supervised deep learning approaches There is often no spatial correlation between the variables
have been proposed that explicitly address the issue of tabular in tabular data sets [80], or the dependencies between
data modelling again. Early works mostly focused on data features are rather complex and irregular. When working
transformation techniques for preprocessing [66]–[68], which with tabular data, the structure and relationships between
are still important today [54]. its features have to be learned from scratch. Thus, the
A huge stimulus was the rise of e-commerce, which inductive biases used in popular models for homoge-
demanded novel solutions, especially in advertising [15], neous data, such as convolutional neural networks, are
[69]. These tasks required fast and accurate estimation on unsuitable for modelling this data type [57], [81], [82].
heterogeneous data sets with many categorical variables, for 3) Dependency on Preprocessing: A key advantage of
which the traditional machine learning approaches are not well deep learning on homogeneous data is that it includes
suited (e.g., categorical features that have high cardinality can an implicit representation learning step [83], so only
lead to very sparse high-dimensional feature vectors and non- a minimal amount of preprocessing or explicit feature
robust models). As a result, researchers and data scientists construction is required. However, for tabular data and
started looking for more flexible solutions, e.g., those based deep neural networks the performance may strongly
on deep neural networks, that can capture complex non-linear depend on the selected preprocessing strategy [84].
dependencies in the data. Handling the categorical features remains particularly
In particular, the click-through rate prediction problem has challenging [54] and can easily lead to a very sparse
received a lot of attention [15], [70], [71]. A large variety of feature matrix (e.g., by using a one-hot encoding scheme)
approaches were proposed, most of them relying on specialized or introduce a synthetic ordering of previously unordered
neural network architectures for heterogeneous tabular data. values (e.g., by using an ordinal encoding scheme). Lastly,
A more recent line of research, sparked by Shavitt & Segal preprocessing methods for deep neural networks may lead
[72], evolved based on the idea that regularization may improve to information loss, leading to a reduction in predictive
the performance of deep neural networks on tabular data [10]. performance [85].
This has led to an intensification of research on regularization 4) Importance of Single Features: While typically chang-
approaches. ing the class of an image requires a coordinated change
Due to the tremendous success of attention-based approaches in many features, i.e., pixels, the smallest possible change
such as transformers on textual [73] and visual data [74], [75], of a categorical (or binary) feature can entirely flip a
researchers have recently also started applying attention-based prediction on tabular data [72]. In contrast to deep neural
methods and self-supervised learning techniques to tabular data. networks, decision-tree algorithms can handle varying
After the introduction of transformer architectures to the field feature importance exceptionally well by selecting a
single feature and appropriate threshold (i.e., splitting) of research in Table II and examine the three methodological
values and “ignoring” the rest of the data sample. categories in detail: data transformation methods (see Subsec-
Shavitt & Segal [72] have argued that individual weight tion IV-A), architecture-based methods (see Subsection IV-B),
regularization may mitigate this challenge and motivated and regularization-based models (see Subsection IV-C).
more work in this direction [10].
With these four fundamental challenges in mind, we continue A. Data Transformation Methods
by organizing and discussing the strategies developed to address Most traditional approaches for deep neural networks on
them. We start by developing a suitable taxonomy. tabular data fall into this group. Interestingly, data preprocessing
plays a relatively minor role in computer vision, even though
D. Unified Taxonomy the field is currently dominated by deep learning solutions [2].
There are many different possibilities to transform tabular
In this section, we introduce a taxonomy of approaches
data, and each may have a different impact on the learning
that allows for a unified view of the field. We divide the
results [54].
works from the deep learning with tabular data literature into
1) Single-Dimensional Encoding: One of the critical obsta-
three main categories: data transformation methods, specialized
cles for deep learning with tabular data are categorical variables.
architectures, and regularization models. In Fig. 1, we provide
Since neural networks only accept real number vectors as
an overview of our taxonomy of deep learning methods for
inputs, these values must be transformed before a model can
tabular data.
use them. Therefore, the first class of methods attempts to
Data transformation methods. The methods in the first
encode categorical variables in a way suitable for deep learning
group transform categorical and numerical data. This is usually
models.
done to enable deep neural network models to better extract Approaches in this group [54] are divided into deterministic
the information signal. Methods from this group do not require techniques, which can be used before training the model, and
new architectures or adaptations of the existing data processing more complicated automatic techniques that are part of the
pipeline. Nevertheless, the transformation step comes at the model architecture. There are many ways for deterministic data
cost of an increased preprocessing time. This might be an encoding; hence we restrict ourselves to the most common
issue for high-load systems [86], particularly in the presence ones without the claim of completeness.
of categorical variables with high cardinality and growing The simplest data encoding technique might be ordinal or
data set size. We can further subdivide this area into Single- label encoding. Every category is just mapped to a discrete
Dimensional Encodings and Multi-Dimensional Encodings. The numeric value, e.g., {Apple, Banana} are encoded as {0, 1}.
former encodings are employed to transform each feature One drawback of this method may be that it introduces an
independently while the latter encoding methods map an entire artificial order to previously unordered categories. Another
record to another representation. straightforward method that does not induce any order is the
Specialized architectures. The biggest share of works one-hot encoding. One additional column for each unique
investigates specialized architectures and suggests that a category is added to the data. Only the column corresponding
different deep neural network architecture is required for tabular to the observed category is assigned the value one, with the
data. Two types of architectures are of particular importance: other values being zero. In our example, Apple could be
hybrid models fuse classical machine learning approaches (e.g., encoded as (1,0) and Banana as (0,1). In the presence
decision trees) with neural networks, while transformer-based of a diverse set of categories in the data, this method can lead
models rely on attention mechanisms. to high-dimensional sparse feature vectors and exacerbate the
Regularization models. Lastly, the group of regularization “curse of dimensionality” problem.
models claims that one of the main reasons for the moderate Binary encoding limits the number of new columns by
performance of deep learning models on tabular data is transforming the qualitative data into a numerical representation
their extreme non-linearity and model complexity. Therefore, (as the label encoding does) and using the binary format of the
strong regularization schemes are proposed as a solution. They number. Again the digits, are split into different columns, but
are mainly implemented in the form of special-purpose loss there are only log(c) new columns if c is the number of unique
functions. categorical values. If we extend our example to three fruits,
We believe our taxonomy may help practitioners find the e.g., {Apple, Banana, Pear}, we only need two columns
methods of choice that can be easily integrated into their to represent them: (01), (10), (11).
existing tool chain. For instance, applying data transformations One approach that needs no extra columns and does not
can result in performance improvements while maintaining include any artificial order is the so-called leave-one-out
the current model architecture. Conversely, using specialized encoding. It is based on the target encoding technique proposed
architectures, the data preprocessing pipeline can be kept intact. in the work by [101], where every category is replaced with the
mean of the target variable of that category. The leave-one-out
IV. D EEP N EURAL N ETWORKS FOR TABULAR DATA encoding excludes the current row when computing the mean
In this section, we discuss the use of deep neural networks of the target variable to avoid overfitting. This approach is
on tabular data for classification and regression tasks according also used in the CatBoost framework [79], a state-of-the-art
to the taxonomy presented in the previous section. We provide machine learning library for heterogeneous tabular data based
an overview of existing deep learning approaches in this area on the gradient boosting algorithm [102].
Method Interpretability Key Characteristics
SuperTML [87] Transform tabular data into images for CNNs

Encoding
VIME [88] Self-supervised learning and contextual embedding
IGTD [80] Transform tabular data into images for CNNs
SCARF [89] Self-supervised contrastive learning
Wide&Deep [90] Embedding layer for categorical features

DeepFM [15] Factorization machine for categorical data
SDT [91] X Distill neural network into interpretable decision tree
xDeepFM [92] Compressed interaction network
Architectures, Hybrid
TabNN [93] DNNs based on feature groups distilled from GBDT

DeepGBM [70] Two DNNs, distill knowlegde from decision tree
NODE [6] Differentiable oblivious decision trees ensemble
NON [94] Network-on-network model
DNN2LR [95] Calculate cross feature wields with DNNs for LR
Net-DNF [57] Structure based on disjunctive normal form
Boost-GNN [96] GNN on top decision trees from the GBDT algorithm
SDTR [97] Hierarchical differentiable neural regression model
TabNet [5] X Sequential attention structure

Architectures,
Transformer
TabTransformer [98] X Transformer network for categorical data

SAINT [9] X Attention over both rows and columns
ARM-Net [99] Adaptive relational modelling with multi-headgated attention network
Non-Param. Transformer [100] Process the entire dataset at once, use attention between data points
RLN [72] Hyperparameters regularization scheme

Regul.
X
Regularized DNNs [10] A ”cocktail” of regularization techniques
TABLE II: Overview of deep learning approaches for tabular data. We organize them in categories ordered chronologically
inside the groups. The “Interpretability” column indicates whether the approach offers some form interpretability for the model’s
decisions. The key characteristics of every model are summarized in the last column.
A different strategy is hash-based encoding. Every category samples from one original data point, a consistency loss Lu can
is transformed into a fixed-size value via a deterministic hash be computed that rewards similar outputs. Combined with a
function. The output size is not directly dependent on the supervised loss Ls from the labelled data, the predictive model
number of input categories but can be chosen manually. minimizes the final loss L = Ls + β · Lu . To summarize,
the VIME network trains an encoder, which is responsible to
2) Multi-Dimensional Encoding: A first automatic encoding transform the categorical and numerical features into a new
strategy is the VIME approach [88]. The authors propose a homogeneous and informative representation. This transformed
self- and semi-supervised deep learning framework for tabular feature vector is used as an input to the predictive model. For
data that trains an encoder in a self-supervised fashion by using the encoder itself, the categorical data can be transformed by
two pretext tasks. Those tasks that are independent from the a simple one-hot-encoding and binary encoding.
concrete downstream task which the predictor has to solve. The
first task of VIME is called mask vector estimation; its goal is to Another stream of research aims at transforming the tabular
determine which values in a sample are corrupted. The second input into a more homogeneous format. Since the revival
task, i.e., feature vector estimation, is to recover the original of deep learning, convolutional neural networks have shown
values of the sample. The encoder itself is a simple multilayer tremendous success in computer vision tasks. Therefore, the
perceptron. This automatic encoding makes use of the fact that work by [87] proposed the SuperTML method, which is a data
there is often much more unlabelled than labelled data. The conversion technique to transform tabular data into an image
encoder learns how to construct an informative homogeneous data format (2-d matrices), i.e., black-and-white images.
representation of the raw input data. In the semi-supervised step, The image generator for tabular data (IGTD) by [80] follows
a predictive model, which is also a deep neural network model, an idea similar to SuperTML. The IGTD framework converts
is trained using the labelled and unlabelled data transformed by tabular data into images to make use of classical convolutional
the encoder. For the encoder, a novel data augmentation method architectures. As convolutional neural networks rely on spatial
is used, corrupting an unlabelled data point multiple times dependencies, the transformation into images is optimized
with different masks. On the predictions from all augmented by minimizing the difference between the feature distance
ranking of the tabular data and the pixel distance ranking of for semi-supervised learning with unlabelled samples that are
the generated image. Every feature corresponds to one pixel, labelled by the deep neural network and used to train a more
which leads to compact images with similar features close at robust decision tree along with the labelled data. The authors
neighbouring pixels. Thus, IGDTs can be used in the absence showed that training a neural model first increases accuracy
of domain knowledge. The authors show relatively solid results over SDTs that are directly learned from the data. However,
for data with strong feature relationships but the method may their distilled trees still exhibit a performance gap to the neural
fail if the features are independent or feature similarities can not networks that were fitted in the initial step. Nevertheless, the
characterize the relationships. In their experiments, the authors model itself shows a clear relationship among different classes
used only gene expression profiles and molecular descriptors in a hierarchical fashion. It groups different categorical values
of drugs as data. This kind of data may lead to a favourable based on the common patterns, e.g., the digits 8 and 9 from the
inductive bias, so the general viability of the approach remains MNIST data set [105]. To summarize, the proposed method
unclear. allows for high interpretability and efficient inference, at the
cost of slightly reduced accuracy.
Follow-up work [97] extends this line of research to
B. Specialized Architectures
heterogeneous tabular data and regression tasks and presents the
Specialized architectures form the largest group of ap- soft decision tree regressor (SDTR) framework. The SDTR is a
proaches for deep tabular data learning. Hence, in this group, neural network which imitates a binary decision tree. Therefore,
the focus is on the development and investigation of novel all neurons, like nodes in a tree, get the same input from the
deep neural network architectures designed specifically for data instead of the output from previous layers. In the case of
heterogeneous tabular data. Guided by the types of available deep networks, the SDTR could not beat other state-of-the-art
models, we divide this group into two sub-groups: Hybrid models, but it has shown promising results in a low-memory
models (presented in IV-B1) and transformer-based models setting, where single tree models and shallow architectures
(discussed in IV-B2). were compared.
1) Hybrid Models: Most approaches for deep neural net- Katzir et al. [57] follow a related idea. Their Net-DNF builds
works on tabular data are hybrid models. They transform the on the observation that every decision tree is merely a form of
data and fuse successful classical machine learning approaches, a Boolean formula, more precisely a disjunctive normal form.
often decision trees, with neural networks. We distinguish They use this inductive bias to design the architecture of a
between fully differentiable models, that can be differentiated neural network, which is able to imitate the characteristics of
with respect to all their parameters and partly differentiable the gradient boosting decision trees algorithm. The resulting
models. Net-DNF was tested for classification tasks on data sets with no
Fully differentiable Models. The fully differentiable models missing values, where it showed results that are comparable to
in this category offer a valuable property: They permit end- those of XGBoost [53]. However, the authors did not mention
to-end deep learning for training and inference by means how to handle high-cardinality categorical data, as the used
of gradient descent optimizers. Thus, they allow for highly data sets contained mostly numerical and few binary features.
efficient implementations in modern deep learning frameworks Linear models (e.g., linear and logistic regression) provide
that exploit GPU or TPU acceleration throughout the code. global interpretability but are inferior to complex deep neural
Popov et al. [6] propose an ensemble of differentiable networks. Usually, handcrafted feature engineering is required
oblivious decision trees [103] – also known as the NODE to improve the accuracy of linear models. Liu et al. [95] use
framework for deep learning on tabular data. Oblivious decision a deep neural network to combine the features in a possibly
trees use the same splitting function for all nodes on the non-linear way; the resulting combination then serves as input
same level and can therefore be easily parallelized. NODE is to the linear model. This enhances the simple model while still
inspired by the successful CatBoost [79] framework. To make providing interpretability.
the whole architecture fully differentiable and benefit from end- The work by Cheng et al. [90] proposes a hybrid architecture
to-end optimization, NODE utilizes the entmax transformation that consists of linear and deep neural network models –
[104] and soft splits. In the original experiments, the NODE Wide&Deep. A linear model that takes single features and a
framework outperforms XGBoost and other GBDT models wide selection of hand-crafted logical expressions on features
on many data sets. As NODE is based on decision tree as an input is enhanced by a deep neural network to improve the
ensembles, there is no preprocessing or transformation of the generalization capabilities. Additionally, Wide&Deep learns an
categorical data necessary. Decision trees are known to handle n-dimensional embedding vector for each categorical feature.
discrete features well. In the official implementation, strings are All embeddings are concatenated resulting in a dense vector
converted to integers using the leave-one-out encoding scheme. used as input to the neural network. The final prediction can
The NODE framework is widely used and provides a sound be understood as a sum of both models. A similar work by
implementation that can be readily deployed. Guo et al. [106] proposes an embedding using deep neural
Frosst & Hinton [91] contribut another model relying on networks for categorical variables.
soft decision trees (SDT) to make neural networks more Another contribution to the realm of Wide&Deep models is
interpretable. They investigated training a deep neural network DeepFM [15]. The authors demonstrate that it is possible to
first, before using a mixture of its outputs and the ground truth replace the hand-crafted feature transformations with learned
labels to train the SDT model in a second step. This also allows Factorization Machines (FMs) [107], leading to an improvement
of the overall performance. The FM is an extension of a linear network based on those feature combinations. Also, structural
model designed to capture interactions between features within knowledge from the trees is transferred to provide an effective
high-dimensional and sparse data efficiently. Similar to the initialization. However, the construction of the network already
original Wide&Deep model, DeepFM also relies on the same takes different extensive computation steps of which one is only
embedding vectors for its “wide” and “deep” parts. In contrast a heuristic to avoid an NP-hard problem. Overall, considering
to the original Wide&Deep model, however, DeepFM alleviates the construction challenges and that an implementation of
the need for manual feature engineering. TabNN was not provided, the practical use of the network
Lastly, Network-on-Network (NON) [94] is a classification seems limited.
model for tabular data, which focuses on capturing the intra- In similar spirit to DeepGBM and TabNN, the work
feature information efficiently. It consists of three components: from [96] proposes using gradient boosting decision trees for
a field-wise network consisting of one unique deep neural the data prepossessing step. The authors show that a decision
network for every column to capture the column-specific tree structure has the form of a directed graph. Thus, the
information, an across-field-network, which chooses the optimal proposed framework exploits the topology information from
operations based on the data set, and an operation fusion the decision trees using graph neural networks [110]. The
network, connecting the chosen operations allowing for non- resulting architecture is coined Boosted Graph Neural Network
linearities. As the optimal operations for the specific data are (BGNN). In multiple experiments, BGNN demonstrates that the
selected, the performance is considerably better than that of proposed architecture is superior to existing solid competitors
other deep learning models. However, the authors did not in terms of predictive performance and training time.
include decision trees in their baselines, the current state-of- 2) Transformer-based Models: Transformer-based ap-
the-art models on tabular data. Also, training as many neural proaches form another subgroup of model-based deep neural
networks as columns and selecting the operations on the fly methods for tabular data. Inspired by the recent surge of interest
may lead to a long computation time. in transformer-based methods and their successes on text and
Partly differentiable Models. This subgroup of hybrid visual data [75], [111], researchers and practitioners have
models aims at combining non-differentiable approaches with proposed multiple approaches using deep attention mechanisms
deep neural networks. Models from this group usually utilize [4] for heterogeneous tabular data.
decision trees for the non-differentiable part. TabNet [5] is one of the first transformer-based models
The DeepGBM model [70] combines the flexibility of deep for tabular data. Like a decision tree, the TabNet architecture
neural networks with the preprocessing capabilities of gradient comprises multiple subnetworks that are processed in a sequen-
boosting decision trees. DeepGBM consists of two neural tial hierarchical manner. According to [5], each subnetwork
networks – CatNN and GBDT2NN. While CatNN is specialized corresponds to one decision step. To train TabNet, each decision
to handle sparse categorical features, GBDT2NN is specialized step (subnetwork) receives the current data batch as input.
to deal with dense numerical features. TabNet aggregates the outputs of all decision steps to obtain
In the preprocessing step for the CatNN network, the the final prediction. At each decision step, TabNet first applies
categorical data are transformed via an ordinal encoding (to a sparse feature mask [112] to perform soft instance-wise
convert the potential strings into integers), and the numerical feature selection. The authors claim that the feature selection
features are discretized, as this network is specialized for can save valuable resources, as the network may focus on
categorical data. The GBDT2NN network distills the knowledge the most important features. The feature mask of a decision
about the underlying data set from a model based on gradient step is trained using attentive information from the previous
boosting decision trees by accessing the leaf indices of the decision step. To this end, a feature transformer module decides
decision trees. This embedding based on decision tree leaves which features should be passed to the next decision step and
was first proposed by [108] for the random forest algorithm. which features should be used to obtain the output at the current
Later, the same knowledge distillation strategy has been adopted decision step. Some layers of the feature transformers are shared
for gradient boosting decision trees [109]. across all decision steps. The obtained feature masks correspond
Using the proposed combination of two deep neural networks, to local feature weights and can also be combined into a global
DeepGBM has a strong learning capacity for both categorical importance score. Accordingly, TabNet is one of the few deep
and numerical features. Distinctively, the authors implemented neural networks that offers different levels of interpretability
and tested DeepGBM’s online prediction performance, which by design. Indeed, experiments show that each decision step
is significantly higher than that of gradient boosting decision of TabNet tends to focus on a particular subdomain of the
trees. On the downside, the leaf indices can be seen as meta learning problem (i.e., one particular subset of features). This
categorical features since these numbers cannot be directly behaviour is similar to convolutional neural networks. TabNet
compared. Also, it is not clear how other data-related issues, also provides a decoder module that is able to preprocess
such as missing values, different scaling of numeric features, input data (e.g., replace missing values) in an unsupervised
and noise influence the predictions produced by the models. way. Accordingly, TabNet can be used in a two-stage self-
The TabNN architecture, introduced by [93], is based supervised learning procedure, which improves the overall
on two principles: explicitly leveraging expressive feature predictive quality. One of the popular Python [113] frameworks
combinations and reducing model complexity. It distills the for tabular data provides an efficient implementation of TabNet
knowledge from gradient boosting decision trees to retrieve [114]. Recently, TabNet has also been investigated in the
feature groups; it clusters them and then constructs the neural context of fair machine learning [115], [116].
Attention-based architectures offer mechanisms for inter- C. Regularization Models

pretability, which is an essential advantage over many hybrid The third group of approaches argues that extreme flexibility
models. Figure 2 shows attentions maps of the TabNet model of deep learning models for tabular data is one of the
and KernelSHAP explanation framework on the Adult data set main learning obstacles and strong regularization of learned
[61]. parameters may improve the overall performance.
Another supervised and semi-supervised approach is intro- One of the first methods in this category was the Regular-
duced by Huang et al. [98]. Their TabTransformer architecture ization Learning Network (RLN) proposed by Shavitt & Segal
uses self-attention-based transformers to map the categorical [72], which uses a learned regularization scheme. The main idea
features to a contextual embedding. This embedding is more is to apply trainable regularization coefficients to each single
robust to missing or noisy data and enables interpretability. weight in a neural network, thereby lowering the sensitivity. To
The embedded categorical features are then together with the efficiently determine the corresponding coefficients, the authors
numerical ones fed into a simple multilayer perceptron. If, in ad- propose a novel loss function termed “Counterfactual Loss”.
dition, there is an extra amount of unlabelled data, unsupervised The regularization coefficients lead to a very sparse network,
pre-training can improve the results, using masked language which also provides the importance of the remaining input
modelling or replace token detection. Extensive experiments features.
show that TabTransformer matches the performance of tree- In their experiments, RLNs outperform deep neural networks
based ensemble techniques, showing success also when dealing and obtain results comparable to those of the gradient boosting
with missing or noisy data. The TabTransformer network puts decision trees algorithm, but the evaluation relies on a data
a significant focus on the categorical features. It transforms set with mainly numerical data to compare the models. The
the embedding of those features into a contextual embedding RLN paper does not address the issues of categorical data.
which is then used as input for the multilayer perceptron. This For the experiments and the example implementation data
embedding is implemented by different multi-head attention- sets with exclusively numerical data (except for the gender
based transformers, which are optimized during training. attribute) were used. A similar idea is proposed in [119], where
regularization coefficients are learned only in the first layer
ARM-net [99] is an adaptive neural network for relation with a goal to extract feature importance.
modelling tailored to tabular data. The key idea of the ARM- Kadra et al. [10] state that simple multilayer perceptrons can
net framework is to model feature interactions with combined outperform state-of-the-art algorithms on tabular data if deep
features (feature crosses) selectively and dynamically by first learning networks are properly regularized. The authors propose
transforming the input features into exponential space and a “cocktail” of regularization with thirteen different techniques
then determining the interaction order and interaction weights that are applied jointly. From those, the optimal subset and their
adaptively for each feature cross. Furthermore, the authors subsidiary hyperparameters are selected. They demonstrate in
propose a novel sparse attention mechanism to generate the extensive experiments that the “cocktails” regularization can not
interaction weights given the input data dynamically. Thus, only improve the performance of multilayer perceptrons, but
users can explicitly model feature crosses of arbitrary orders these simple models also outperform tree-based architectures.
with noisy features filtered selectively. On the downside, the extensive per-data set regularization and
SAINT (Self-Attention and Intersample Attention Trans- hyperparameter optimization take much more computation time
former) [9] is a hybrid attention approach, combining self- than the gradient boosting decision trees algorithm.
attention [4] with inter-sample attention over multiple rows. There are several other works [120]–[122] showing that
When handling missing or noisy data, this mechanism allows strong regularization of deep neural networks can be beneficial
the model to borrow the corresponding information from for tabular data.
similar samples, which improves the model’s robustness. The
technique is reminiscent of nearest-neighbour classification. V. TABULAR DATA G ENERATION
In addition, all features are embedded into a combined dense
latent vector, enhancing existing correlations between values For many applications, the generation of realistic tabular
from one data point. To exploit the presence of unlabelled data, data is fundamental. Two of the main purposes are data
a self-supervised contrastive pre-training can further improve augmentation [124] and data imputation (i.e., the filling of
the results, minimizing the distance between two views of the missing values) [41], [42] and rebalancing [43], [44], [125],
same sample and maximizing the distance between different [126]. Another highly relevant topic is privacy-aware machine
ones. Like the VIME framework (Section IV-A1), SAINT uses learning [45], [46], [127] where generated data can potentially
CutMix [117] to augment samples in the input space and uses be leveraged to overcome privacy concerns.
mixup [118] in the embedding space.
Finally, even some new learning paradigms are being A. Methods
proposed. For instance, the Non-Parametric Transformer (NPT) While the generation of images and text is highly ex-
[100] does not construct a mapping from individual inputs to plored [128]–[130], generating synthetic tabular data is still
outputs but uses the entire data set at once. By using attention a challenge. The mixed structure of discrete and continuous
between data points, relations between arbitrary samples can features along with their different value distributions still poses
be modelled and leveraged for classifying test samples. a significant challenge.
the enormous number of approaches, we list the most influential

works that address the problem of data generation with a
hours-per-week
hours-per-week
education-num
education-num
native-country
native-country
marital-status
marital-status
particular focus on tabular data. We exclude works that are
relationship
relationship
capital-gain
capital-gain
capital-loss
capital-loss
occupation
occupation
education
education
workclass
workclass
targeted towards highly domain-specific tasks.
fnlwgt
fnlwgt
race
race
In the following section, we will briefly discuss the most
age
age
sex
sex
relevant approaches that helped shape the domain. For example,
MedGAN by [46] was one of the first works and provides
samples (one per line)
samples (one per line)

a deep learning model to generate patient records. As all
the features in their work are discrete, this model cannot
be easily transferred to arbitrary tabular data sets. The table-
GAN approach by [145] adapts the Deep Convolutional GAN
for tabular data. Specifically, the features from one record
are converted into a matrix, so that they can be processed
by convolutional filters of a convolutional neural network.
(a) TabNet attributions (b) KernelSHAP attributions However, it remains unclear to which extent the inductive bias
used for images are suitable for tabular data.
Fig. 2: Interpretable learning with the TabNet [5] architecture.
We compare the attributions provided by the model for a sample The approach by Xu et al. [144] focuses on the correlation
from the UCI Adult data set with those provided by the game between the features of one data point. The authors first propose
theoretic KernelSHAP framework [123]. the mode-specific normalization technique for data prepro-
cessing that allows to transform non-Gaussian distributions
in the continuous columns. They express numeric values in
terms of a mixture component number and the deviation from
Classical approaches for the data generation task include that component’s center. This allows to represent multi-modal
Copulas [131], [132] and Bayesian Networks [133]. Among and skewed distributions. Their generative solution, coined
Bayesian Networks those based on the Chow-Liu approxima- CTGAN, uses the conditional GAN architecture to enforce
tion [134] are especially popular. learning proper conditional distributions for each column. To
In the deep-learning era, Generative Adversarial Networks obtain categorical values and to allow for backpropagation
(GANs) [135] have proven highly successful for the generation in the presence of categorical values, the gumbel-softmax
of images [128], [136]. GANs were recently introduced as trick [146] is utilized. The authors also propose a model
an original way to train a generative deep neural network based on Variational Autoencoders, named TVAE (Tabular
model. They consist of two separate models: a generator Variational Autoencoder) which outperforms their suggested
G that generates samples from the data distribution, and a GAN approach. Both approaches can be considered state-of-
discriminator D that estimates the probability that a sample the-art.
came from the ground truth distribution. Both G and D are While GANs and VAEs are prevalent, other recently pro-
usually chosen to be non-linear functions such as a multilayer posed architectures include machine-learned Causal Mod-
perceptrons. To learn a generator distribution pg over data x, els [147] and Invertible Flows [45]. When privacy is the main
the generator G(z; θg ) maps samples from a noise distribution factor of concern, models such as PATE-GAN [148] provide
pz (z) (e.g., the Gaussian distribution) to the input data space. generative models with certain differential privacy guarantees.
The discriminator D(x; θd ) outputs the probability that a data Although very relevant for practical applications, such privacy
point x comes from the training data’s distribution pdata rather guarantees and related federated learning approaches with
than from the generator’s output distribution pg . During joint tabular data are outside the scope of this review.
training of G and D, G will start generating successively Fan et al. [127] compare a variety of different GAN
more realistic samples to fool the discriminator D. For more architectures for tabular data synthesis and recommend using
details on GANs, we refer the interested reader to the original a simple, fully connected architecture with a vanilla GAN loss
paper [135]. with minor changes to prevent mode-collapse. They also use
Although it was found that GANs lag behind at the genera- the normalization proposed by [144]. In their experiments, the
tion of discrete outputs such as natural language [130], they are Wasserstein GAN loss or the use of convolutional architectures
still frequently chosen to generate tabular data. Vanilla GANs or on tabular data does boost the generative performance.
derivates such as the Wasserstein GAN (WGAN) [137], WGAN
with gradient penalty (WGAN-GP) [138], Cramér GAN [139],
or the Boundary seeking GAN [140], which is designed to B. Assessing Generative Quality
model discrete data, are commonly used in the literature to To assess the quality of the generated data, several per-
generate tabular data. Moreover, VeeGAN [141] is frequently formance measures are used. The most common approach
used for tabular data. Apart from GANs, autoencoder-based is to define a proxy classification task and train one model
architectures – in particular those relying on Variational for it on the real training set and another on the artificially
Autoencoders (VAEs) [142] – have been proposed [143], [144]. generated data set. With a highly capable generator, the
In Table III, we provide an overview of tabular generation predictive performance of the artificial-data model on the
approaches, that use deep learning techniques. Note that due to real-data test set should be almost on par with its real-data
Method Based upon Application
medGAN [46] Autoencoder+GAN Medical Records

TableGAN [145] DCGAN General
Mottini et al. [149] Cramér GAN Passenger Records
Camino et al. [150] medGAN, ARAE General
WGAN-GP,
medBGAN, medWGAN [151] Medical Records
Boundary seeking GAN
GAN with AE
ITS-GAN [124] General
for constraints
CTGAN, TVAE [144] Wasserstein GAN, VAE General

actGAN [126] WGAN-GP Health Data
VAEM [143] VAE (Hierarchical) General
OVAE [152] Oblivious VAE General
AE+SMOTE (in
TAEI [44] General
multiple setups)
Causal-TGAN [153] Causal-Model, WGAN-GP General

Copula-Flow [45] Invertible Flows General
TABLE III: Generation of tabular data using deep neural network models (in chronological order).
counterpart. This measure is often referred to as machine these techniques are gaining importance for researchers and
learning efficacy and used in [46], [144], [149]. In non-obvious practitioners alike, we dedicate the following section to
classification tasks, an arbitrary feature can be used as a reviewing these methods.
label and predicted [46], [150], [151]. Another approach is to
visually inspect the modelled distributions per-feature, e.g., the A. Feature Highlighting Explanations
cumulative distribution functions [124] or compare the expected
Local input attribution techniques seek to explain the
values in scatter plots [46], [150]. A more quantitative approach
behaviour of machine learning models instance by instance.
is the use of statistical tests, such as the Kolmogorov-Smirnov
Those methods aim to highlight the influence the inputs have
test [154], to assess the distributional difference [151]. On
on the prediction by assigning importance scores to the input
synthetic data sets, the output distribution can be compared to
features. Some popular approaches for model explanations aim
the ground truth, e.g., in terms of log-likelihood [144], [147].
at constructing classification models that are explainable by
Because overfitted models can also obtain good scores, [144]
design [158]–[160]. This is often achieved by enforcing the
propose evaluating the likelihood of a test set under an estimate
deep neural network model to be locally linear. Moreover,
of the GAN’s output distribution. Especially in a privacy-
if the model’s parameters are known and can be accessed,
preserving context, the distribution of the Distance to Closest
then the explanation technique can use these parameters to
Record (DCR) can be calculated and compared to the respective
generate the model explanation. For such settings, relevance-
distances on the test set [145]. This measure is important
propagation-based methods, e.g., [161], [162], and gradient-
to assess the extent of sample memorization. Overall, we
based approaches, e.g., [163]–[165], have been suggested. In
conclude that a single measure is not sufficient to assess
cases where the parameters of the neural network cannot be
the generative quality. For instance, a generative model that
accessed, model-agnostic approaches can prove useful. This
memorizes the original samples will score well in the machine
group of approaches seeks to explain a model’s behavior locally
learning efficiency metric but fail the DCR check. Therefore,
by applying surrogate models [123], [166]–[169], which are
we highly recommend using several evaluation measures that
interpretable by design and are used to explain individual
focus on individual aspects of data quality.
predictions of black-box machine learning models. In order to
VI. E XPLANATION M ECHANISMS FOR D EEP L EARNING test the performance of these black-box explanations techniques,
WITH TABULAR DATA Liu et al. [170] suggest a python-based benchmarking library.
Explainable machine learning is concerned with the problem
of providing explanations for complex machine learning models. B. Counterfactual Explanations
With stricter regulations for automated decision making [48] From the perspective of algorithmic recourse, the main
and the adoption of machine learning models in high-stakes purpose of counterfactual explanations is to suggest constructive
domains such as finance and healthcare [52], interpretability is interventions to the input of a deep neural network so that
becoming a key concern. Towards this goal, various streams the output changes to the advantage of an end user. In simple
of research follow different explainability paradigms. Among terms, a minimal change to the feature vector that will flip
these, feature attribution methods and counterfactual expla- the classification outcome is computed and provided as an
nations are two of the popular forms [155]–[157]. Because explanation. By emphasizing both the feature importance and
the recommendation aspect, counterfactual explanation methods A. Data Sets

can be further divided into three different groups: works that In computer vision, there are many established data sets
assume that all features can be independently manipulated for the evaluation of new deep learning architectures such
[171] and works that focus on manifold constraints to capture as MNIST [105], CIFAR [191], and ImageNet [192]. On the
feature dependencies. contrary, there are no established standard heterogeneous data
In the class of independence-based methods, where the input sets. Carefully checking the works listed in Section IV, we
features of the predictive model are assumed to be independent, identified over 100 different data sets with different charac-
some approaches use combinatorial solvers to generate recourse teristics in their respective experimental evaluation sections.
in the presence of feasibility constraints [172]–[175]. Another We note that the small overlap between the mentioned works
line of research deploys gradient-based optimization to find low- makes it hard to compare the results across these works in
cost counterfactual explanations in the presence of feasibility general. Therefore, in this work, we deliberately select data sets
and diversity constraints [176]–[178]. The main problem with covering the entire range of characteristics, such as data domain
these approaches is that they abstract from input correlations. (e.g., finance, e-commerce, geography, physics), different types
To alleviate this problem and to suggest realistic looking of target variables (classification, regression), varying number
counterfactuals, researchers have suggested building recourse of categorical variables and continuous variables, and differing
suggestions on generative models [179]–[184]. The main idea sample sizes (small to large). Furthermore, most of the selected
is to change the geometry of the intervention space to a lower data sets were previously featured in multiple studies.
dimensional latent space, which encodes different factors of The first data set of our study is the Home Equity Line of
variation while capturing input dependencies. To this end, these Credit (HELOC) data set provided by FICO [193]. This data
methods primarily use (tabular data) variational autoencoders set consists of anonymized information from real homeowners
[142], [185]. In particular, Mahajan et al. [182] demonstrate who applied for home equity lines of credit. A HELOC is a
how to encode various feasibility constraints into such models. line of credit typically offered by a bank as a percentage of
However, an extensive comparison across this class of methods home equity. The task consists of using the information about
is still missing since it is difficult to measure how realistic the
the applicant in their credit report to predict whether they will
generated data are in the context of algorithmic recourse. repay their HELOC account within a two-year period.
More recently, a few works have suggested to develop We further use the Adult Income data set [61], which is
counterfactual explanations that are robust to model shifts among the most popular tabular data sets used in the surveyed
and noise in the recourse implementations [186]–[188]. A work (5 usages). It includes basic information about individuals
comprehensive treatment on how to extend these lines of work such as: age, gender, education, etc. The target variable is
to arbitrary high cardinality categorical variables is still an binary; it represents high and low income.
open problem in the field. The largest tabular data set in our study is HIGGS, which
For a more fine-grained overview over the literature on
stems from particle physics. The task is to distinguish between
counterfactual explanations we refer the interested reader to
signals with Higgs bosons (HIGGS) and a background process
the most recent surveys [189], [190]. Finally, Pawelczyk et al.
[194]. Monte Carlo simulations [195] were used to produce
[157] have implemented an open-source python library which
the data. In the first 21 columns (columns 2-22), the particle
provides support for many of the aforementioned counterfactual
detectors in the accelerator measure kinematic properties. In
explanation models.
the last seven columns, these properties are analyzed. In total,
VII. E XPERIMENTS HIGGS includes eleven million rows. In contrast to other
data sets of our study, the HIGGS data set contains only
Although several experimental studies have been published
numerical or continuous variables. Since DeepFM, DeepGBM,
in recent years [7], [10], an exhaustive comparison between
and TabTransformer models require at least one categorical
existing deep learning approaches for heterogeneous tabular
attribute, we binarize the twenty-first variable into a categorical
data is still missing in the literature. For example, important
variable with three groups.
aspects of deep learning models such as training and inference
The Covertype data set [61] is multi-classification data set
time, model size, and interpretability, are not discussed.
To fill this gap, we present an extensive empirical comparison which holds cartographic information about land cells (e.g.,
of machine and deep learning methods on real-world data elevation, slope). The goal is to predict which one out of seven
sets with varying characteristics in this section. We discuss forest cover types is present in the cell.
the data set choice (VII-A), the results (VII-B), and present Finally, we utilize the California Housing data set [196],
a comparison of the training and inference time for all the which contains information about a number of properties.
machine learning models considered in this survey (VII-C). The prediction task (regression) is to estimate price of the
We also discuss the size of deep learning models. Lastly, to corresponding home.
the best of our knowledge, we present the first comparison of The fundamental characteristics of the selected data sets are
explainable deep learning methods for tabular data (VII-D). We summarized in Table VI.
release the full source code of our experiments for maximum
transparency1 . B. Open Performance Benchmark on Tabular Data
1 Open benchmarking on tabular data for machine learning models: 1) Hyperparameter Selection: In order to do a fair eval-
https://github.com/kathrinse/TabSurvey. uation, we use the Optuna library [203] with 100 iterations
HELOC Adult HIGGS Covertype California for the other models, the performance was highly dependent
Income Housing
on the chosen data set. DeepFM performed best (among the
Samples 9.871 32.561 11 M. 581.012 20.640
Num. features 21 6 27 52 8 deep learning models) on the Adult dataset and second-best
Cat. features 2 8 1 2 0 on the California Housing data set, but returned only weak
Task Binary Binary Binary Multi-Class Regression
Classes 2 2 2 7 - results on HELOC.
TABLE IV: Main properties of the real-world heterogeneous
tabular data sets used in this survey. We also indicate the data C. Run Time Comparison
set task, where “Binary” stands for binary classification, and We also analyse the training and inference time of the
“Multi-class” represents multi-class classification. models in comparison to their performance. We plot the time-
performance characteristic for the models in Fig. 3 and Fig.
4 for the Adult and the HIGGS data set respectively. While
for each model to tune hyperparameters. Each hyperparameter the training time of gradient-boosting-based models is lower
configuration was cross-validated with five folds. The hyperpa- than that of most deep neural network-based methods, their
rameter ranges used are publicly available online along with inference time on the HIGGS data set with 11 million samples
our code. We laid out the search space based on the information is significantly higher: for XGBoost, the inference time amounts
given in the corresponding papers and recommendations from to 5995 seconds whereas inference times for MLP and SAINT
the framework’s authors. are 10.18 and 282 seconds respectively. All gradient-boosting
2) Data Preprocessing: We prepossessed the data in the and deep learning models were trained on the same GPU.
same way for every machine learning model by applying zero-
mean, unit-variance normalization to the numerical features D. Interpretability Assessment
and an ordinal encoding to the categorical ones. The missing As opposed to the pure on-task performance, interpretability
values were substituted with zeros for the linear regression and of the models is becoming an increasingly important character-
models based on pure neural networks since these methods istic. Therefore, we end this section with a distinct assessment
cannot accept them otherwise. We apply the ordinal encoding of the interpretability properties claimed by some methods.
to categorical values for all models. According to the work [54], The model size (number of parameters) can provide a first
the chosen encoding strategy shows comparable performance to intuition of the interpretability of the models. Therefore, we
more advanced methods. We explicitly specify which features provide a size comparison of deep learning models in Fig. 5.
are categorical for LightGBM, DeepFM, DeepGBM, TabNet, Admittedly, explanations can be provided in very different
TabTransformer, and SAINT, since these approaches provide forms, which may each have their own use-cases. Hence, we can
special functionality dedicated to categorical values, e.g., only compare explanations that have a common form. In this
learning an embedding of the categories. work, we chose feature attributions as the explanation format
3) Reproducibility and Extensibility: For maximum repro- because they are the prevalent form of post-hoc explainability
ducibility, we run all experiments in a docker container [204]. for the models considered in this work. Remarkably, the models
We underline again that our full code is publicly released so that build on the transformer architecture (Section IV-B2) often
that the experiments can be replicated. The mentioned datasets claim some extent of interpretability through the attention maps
are also publicly available and can be used as a benchmark [9]. To verify this hypothesis and assess the attribution provided
for novel methods. We would highly welcome contributed by some of the frameworks in practice, we run an ablation test
implementations of additional methods from the data science with the features that were attributed the highest importance
community. over all samples. Furthermore, due to the lack of ground truth
4) Results: The results of our experiments are shown in attribution values, we compare individual attributions to the
Table V. They draw a different picture than many recent well-known KernelSHAP values [123].
research papers may suggest: For all but the very large HIGGS Evaluation of the quality of feature attribution is known to
data set, the best scores are still obtained by boosted decision be a non-trivial problem [206]. We measure the fidelity [207]
tree ensembles. XGBoost and CatBoost outperform all deep of the attributions by successively removing the features that
learning-based approaches on the small and medium data sets, have the highest mean importance assigned (Most Relevant
the regression data set, and the multi-class data set. For the First, MoRF [207]). We then retrain the model on the reduced
large-scale HIGGS, SAINT outperforms the classical machine feature set. A sharp drop in predictive accuracy indicates that
learning approaches. This suggests that for very large tabular the discriminative features were successfully identified and
data sets with predominantly continuous features, modern removed. We do the same for the inverse order, Least Relevant
neural network architectures may have an advantage over First (LeRF), which removes the features deemed unimportant.
classical approaches after all. In general, however, our results In this case, the accuracy should stay high as long as possible.
are consistent with the inferior performance of deep learning For the attention maps of TabTransformer and SAINT, we
techniques in comparison to approaches based on decision tree either use the sum over the entire columns of the intra-feature
ensembles (such as gradient boosting decision trees) on tabular attention maps as an importance estimate or only take the
data that was observed in various Kaggle competitions [205]. diagonal (feature self-attentions) as attributions.
Considering only deep learning approaches, we observe that The obtained curves are visualized in Fig. 6. For the MoRF
SAINT provided competitive results across data sets. However, order, TabNet and TabTransformer with the diagonal of the
HELOC Adult HIGGS Covertype Cal. Housing

Acc ↑ AUC ↑ Acc ↑ AUC ↑ Acc ↑ AUC ↑ Acc ↑ AUC ↑ MSE ↓
Linear Model 73.0±0.0 80.1±0.1 82.5±0.2 85.4±0.2 64.1±0.0 68.4±0.0 72.4±0.0 92.8±0.0 0.528±0.008
KNN [65] 72.2±0.0 79.0±0.1 83.2±0.2 87.5±0.2 62.3±0.1 67.1±0.0 70.2±0.1 90.1±0.2 0.421±0.009
Decision Tree [197] 80.3±0.0 89.3±0.1 85.3±0.2 89.8±0.1 71.3±0.0 78.7±0.0 79.1±0.0 95.0±0.0 0.404±0.007
Random Forest [198] 82.1±0.2 90.0±0.2 86.1±0.2 91.7±0.2 71.9±0.0 79.7±0.0 78.1±0.1 96.1±0.0 0.272±0.006
XGBoost [53] 83.5±0.2 92.2±0.0 87.3±0.2 92.8±0.1 77.6±0.0 85.9±0.0 97.3±0.0 99.9±0.0 0.206±0.005
LightGBM [78] 83.5±0.1 92.3±0.0 87.4±0.2 92.9±0.1 77.1±0.0 85.5±0.0 93.5±0.0 99.7±0.0 0.195±0.005
CatBoost [79] 83.6±0.3 92.4±0.1 87.2±0.2 92.8±0.1 77.5±0.0 85.8±0.0 96.4±0.0 99.8±0.0 0.196±0.004
Model Trees [199] 82.6±0.2 91.5±0.0 85.0±0.2 90.4±0.1 69.8±0.0 76.7±0.0 - - 0.385±0.019
MLP [200] 73.2±0.3 80.3±0.1 84.8±0.1 90.3±0.2 77.1±0.0 85.6±0.0 91.0±0.4 76.1±3.0 0.263±0.008
DeepFM [15] 73.6±0.2 80.4±0.1 86.1±0.2 91.7±0.1 76.9±0.0 83.4±0.0 - - 0.260±0.006
DeepGBM [70] 78.0±0.4 84.1±0.1 84.6±0.3 90.8±0.1 74.5±0.0 83.0±0.0 - - 0.856±0.065
RLN [72] 73.2±0.4 80.1±0.4 81.0±1.6 75.9±8.2 71.8±0.2 79.4±0.2 77.2±1.5 92.0±0.9 0.348±0.013
TabNet [5] 81.0±0.1 90.0±0.1 85.4±0.2 91.1±0.1 76.5±1.3 84.9±1.4 93.1±0.2 99.4±0.0 0.346±0.007
VIME [88] 72.7±0.0 79.2±0.0 84.8±0.2 90.5±0.2 76.9±0.2 85.5±0.1 90.9±0.1 82.9±0.7 0.275±0.007
TabTransformer [98] 73.3±0.1 80.1±0.2 85.2±0.2 90.6±0.2 73.8±0.0 81.9±0.0 76.5±0.3 72.9±2.3 0.451±0.014
NODE [6] 79.8±0.2 87.5±0.2 85.6±0.3 91.1±0.2 76.9±0.1 85.4±0.1 89.9±0.1 98.7±0.0 0.276±0.005
Net-DNF [57] 82.6±0.4 91.5±0.2 85.7±0.2 91.3±0.1 76.6±0.1 85.1±0.1 94.2±0.1 99.1±0.0 -
STG [201] 73.1±0.1 80.0±0.1 85.4±0.1 90.9±0.1 73.9±0.1 81.9±0.1 81.8±0.3 96.2±0.0 0.285±0.006
NAM [202] 73.3±0.1 80.7±0.3 83.4±0.1 86.6±0.1 53.9±0.6 55.0±1.2 - - 0.725±0.022
SAINT [9] 82.1±0.3 90.7±0.2 86.1±0.3 91.6±0.2 79.8±0.0 88.3±0.0 96.3±0.1 99.8±0.0 0.226±0.004
TABLE V: Open performance benchmark results based on (stratified) 5-fold cross-validation. We use the same fold splitting
strategy for every data set. The top results for each dataset are in bold, we also underline the second-best results. The mean
and standard deviation values are reported for each baseline model. Missing results indicate that the corresponding model could
not be applied to the task type (regression or multi-class classification).
XGBoost CatBoost
CatBoost XGBoost
0.87 0.87
Random Forest DeepFM DeepFM Random Forest SAINT
SAINT
0.86 NODE 0.86 Net-DNF
Decision Tree Net-DNF TabNet Decision Tree STG NODE
STG TabTransformer
TabTransformer
0.85 MLP 0.85 TabNet MLP
VIME
Accuracy
Accuracy
Model Trees Model Trees

VIME
0.84 DeepGBM 0.84 DeepGBM
NAM
KNN KNN
0.83 Linear Model NAM 0.83 Linear Model
0.82 0.82
RLN RLN
0.81 0.81
10−2 10−1 100 101 102 10−2 10−1 100 101
Training time (seconds) Inference time (seconds)
Fig. 3: Train (left) and inference (right) time benchmarks for selected methods on the Adult data set with 32.561 samples. The
circle size reflects the accuracy standard deviation.
attention head as attributions seem to perform best. For LeRF, of agreement, we compute the Spearman Rank Correlation
TabNet is the only significantly better method than the others. between the attributions by the SHAP framework and the
For TabTransformer, taking the diagonal of the attention matrix tabular data models. The correlation we observe is surprisingly
seems to increase the performance, whereas for SAINT, there low across all models, and sometimes it is even negative, which
is almost no difference. We additionally compare the attribution means that a higher SHAP attribution will probably result in a
values obtained to values from the KernelSHAP attribution lower attribution by the model.
method. Unfortunately, there are no ground truth attributions
In these two simple benchmarks, the transformer models
to compare with. However, the SHAP framework has a solid
were not able to produce convincing feature attributions out-
grounding in game theory and is widely deployed [50]. We
of-the-box. We come to the conclusion that more profound
only compare the absolute values of the attributions, as the
benchmarks of the claimed interpretability characteristics and
attention maps are constrained to be positive. As a measure
their usefulness in practice are necessary.
SAINT
SAINT 0.80
0.80
0.78 VIME CatBoost XGBoost

XGBoost CatBoost VIME MLP DeepFM
0.78 NODE
DeepFM
NODE TabNet
MLP
Accuracy
Accuracy
0.76 Net-DNF
0.76 Net-DNF TabNet
DeepGBM
STG STG
TabTransformer TabTransformer
0.74 0.74
DeepGBM
RLN RLN
Random Forest Random Forest
0.72 Decision Tree 0.72 Decision Tree
Model Trees Model Trees

0.70 0.70
102 103 104 105 106 100 101 102 103

Training time (seconds) Inference time (seconds)
Fig. 4: Train (left) and inference (right) time benchmarks for selected methods on the HIGGS data set with 11 million samples.
The circle size reflects the accuracy standard deviation.
full
DeepFM SAINT
0.86 NODE
TabNet Net-DNF
TabTransformer
Accuracy
STG SAINT
0.85 MLP
DeepGBM
SAINT, diag. attn.
VIME TabNet
Accuracy
0.84 TabTransformer, diag. attn.

NAM
TabTransformer
random
0.83
0 4 8 12
No. features removed
0.82
RLN
(a) Most Relevant First (MoRF)
0.81
101 102 103 104
Model size (Number trainable parameters)
full
Fig. 5: A size comparison of deep learning models on the

Accuracy
Adult data set. The circle size reflects standard deviation. SAINT
SAINT, diag. attn.
TabNet
Model, attention used Spearman Corr. TabTransformer, diag. attn.
TabTransformer, columnw. attention -0.01 ± 0.008 random
TabTransformer
TabTransformer, diag. attention 0.00 ± 0.010
TabNet 0.07 ± 0.009 0 4 8 12
SAINT, columnw. attention -0.04 ± 0.007 No. features removed
SAINT, diag. attention 0.01 ± 0.007
(b) Least Relevant First (LeRF)
TABLE VI: Spearman rank correlation of the provided attribu-
tion with KernelSHAP values as ground truth. Results were Fig. 6: Resulting curves of the global attribution benchmark
computed on 750 random samples from the Adult data set. for feature attributions (15 runs on Adult). Standard errors are
indicated by the shaded area. For the MoRF order, an early
drop in accuracy is desirable, while for LeRF, the accuracy
VIII. D ISCUSSION AND F UTURE P ROSPECTS should stay as high as possible.
In this section, we summarize our findings and discuss
current and future trends in deep learning approaches for tabular
data (Section VIII-A). Moreover, we identify several open advantage of significantly less training time. Even though it has
research questions that could be tackled to advance the field been six years since the XGBoost publication [53] and over
of tabular deep neural networks (Section VIII-B). twenty years since the publishing of original gradient boosting
paper [102], we can state that despite much research effort in
deep learning, the state of the art for tabular data remains largely
A. Summary and Trends unchanged. However, we observed that for very large data sets,
Decision Tree Ensembles are still State-of-the-Art. In a approaches based on deep learning may still be able to achieve
fair comparison on multiple data sets, we demonstrated that competitive performance and even outperform classical models.
models based on tree-ensembles, such as XGBoost, LightGBM, In summary, we think that a fundamental reorientation of the
and CatBoost, still outperform the deep learning models on domain may be necessary. For now, the question of whether the
most data sets that we considered and come with the additional use of current deep learning techniques is beneficial for tabular
data can generally be answered in the negative. This applies data generators at hand, developers can use large, synthetic,
in particular to small heterogeneous data sets that are common and yet realistic data sets to develop better models, while not
in applications. Hence, instead of proposing more and more being subject to privacy concerns [148]. Unfortunately, the
complex models, we argue that a more profound understanding generation task is as hard as inference in predictive models,
of the reasons for this performance gap is needed. so progress in both areas will likely go hand in hand.
Unified Benchmarking. Furthermore, our results highlight Interpretable Deep Learning Models for Tabular Data.
the need for unified benchmarks. There is no consensus in Interpretability is undoubtedly desirable, particularly for tabular
the machine learning community on how to make a fair data models frequently applied to personal data, e.g., in
and efficient comparison. Shwartz-Ziv & Armon [7] show healthcare and finance. An increasing number of approaches
that the choice of benchmarking data sets can have a non- offer it out-of-the-box but most current deep neural network
negligible impact on the performance assessment. While we models are still mainly concerned with the optimization of a
chose common data sets with varying characteristics for our chosen error metric. Therefore, extending existing open-source
experiments, a different choice of data sets or hyperparameter libraries (see [157], [170]) aimed at interpreting black-box
such as the encoding use (e.g., use one-hot encoding) may lead models helps advance the field. Moreover, interpretable deep
to a different outcome. Because of the excessive number of data tabular learning is essential for understanding model decisions
sets (in the eighteen works listed in Table II, over 100 different and results, especially for life-critical applications. However,
data sets are used), there is a necessity for a standardized much of the state-of-the-art recourse literature does not offer
benchmarking procedure, which allows to identify significant easy support of heterogeneous tabular data and lacks metrics
progress with respect to the state of the art. With this work, to evaluate the quality of heterogeneous data recourse. Finally,
we also propose an open-source benchmark for deep learning model explanations can also be used to identify and mitigate
models on tabular data. For tabular data generation tasks, potential bias or eliminate unfair discrimination against certain
Xu et al. [144] proposes a sound evaluation framework with groups [208].
artificial and real-world data sets (Sec. V-B), but researchers Learning From Evolving Data Streams. Many modern
need to agree on common benchmarks in this subdomain as applications are subject to continuously evolving data streams,
well. e.g., social media, online retail, or healthcare. Streaming data
Tabular Data Preprocessing. Many of the challenges for are usually heterogeneous and potentially unlimited. Therefore,
deep neural networks on tabular data are related to the observations must be processed in a single pass and cannot
heterogeneity of the data (e.g., categorical and sparse values). be stored. Indeed, online learning models can only access a
Therefore, some deep learning solutions transform them into a fraction of the data at each time step. Furthermore, they have
homogeneous representation more suitable to neural networks. to deal with limited resources and shifting data distributions
While the additional overhead is small, such transforms can (i.e., concept drift). Hence, hyperparameter optimization and
boost performance considerably and should thus be among the model selection, as typically involved in deep learning, are
first strategies applied in real-world scenarios. usually not feasible in a data stream. For this reason, despite
Architectures for Deep Learning on Tabular Data. the success of deep learning in other domains, less complex
Architecture-wise, there has been a clear trend towards methods such as incremental decision trees [209], [210] are
transformer-based solutions (Section IV-B2) in recent years. often preferred in online learning applications.
These approaches offer multiple advantages over standard
neural network architectures, for instance, learning with at-
tention over both categorical and numerical features. Moreover, B. Open Research Questions
self-supervised or unsupervised pre-training that leverages Several open problems need to be addressed in future
unlabelled tabular data to train parts of the deep learning research. In this section, we will list those we deem fundamental
model is gaining popularity, not only among transformer- to the domain.
based approaches. Performance-wise, multiple independent Information-theoretic Analysis of Encodings. Encoding
evaluations demonstrate that deep neural network methods from methods are highly popular when dealing with tabular data.
the hybrid (Sec. IV-B1) and transformers-based (Sec. IV-B2) However, the majority of data preprocessing approaches for
groups exhibit superior predictive performance compared to deep neural networks are lossy in terms of information content.
plain deep neural networks on various data sets [9], [55], Therefore, it is challenging to achieve an efficient, almost
[70], [93]. This underlines the importance of special-purpose lossless transformation of heterogeneous tabular data into
architectures for tabular data. homogeneous data. Nevertheless, the information-theoretic view
Regularization Models for Tabular Data. It has also been on these transformations remains to be investigated in detail
shown that regularization reduces the hypersensitivity of deep and could shed light on the underlying mechanisms.
neural network models and improves the overall performance Computational Efficiency in Hybrid Models. The work
[10], [72]. We believe that regularization is one of the crucial by Shwartz-Ziv & Armon [7] suggests that the combination
aspects for a more robust and accurate performance of deep of a gradient boosting decision tree and deep neural networks
neural networks on tabular data and is gaining momentum. may improve the predictive performance of a machine learning
Deep Generative Models for Tabular Data. Powerful system. However, it also leads to growing complexity. Training
tabular data generation is essential for the development of high- or inference times, which far exceed those of classical machine
quality models, particularly in a privacy context. With suitable learning approaches, are a recurring problem when developing
hybrid models. We conclude that the integration of state-of- latent space is a promising direction. This was investigated by
the-art approaches from classical machine learning and deep Darabi & Elor [44] for minority oversampling. Nevertheless,
learning has not been conclusively resolved yet and future the reported improvements are only marginal. Thus, future work
work should be conducted on how to mitigate the trade-off is required to find simple, yet effective random transformations
between predictive performance and computational complexity. to enhance tabular training sets.
Specialized Regularizations. We applaud recent research on Self-supervised Learning. Large-scale labelled data are
regularization methods, in which we see a promising direction usually required to train deep neural networks; however, the
that necessitates further exploration. Whether context- and data labelling is an expensive task. To avoid this expensive
architecture-specific regularizations for tabular data can be step, self-supervised methods propose to learn general feature
found remains an open question. However, a recent work representations from available unlabelled data. These methods
[211] indicates that regularization techniques for deep neural have also shown astonishing results in computer vision and
networks such as weight decay and data augmentation produce natural language processing [218], [219]. Only a few recent
an unfair model across classes. Additionally, it is relevant to works in this direction [88], [89], [220] deal with heteroge-
explore the theoretical constraints that govern the success of neous data. Hence, novel self-supervised learning approaches
regularization on tabular data more profoundly. dedicated to tabular data might be worth investigating.
Novel Processes for Tabular Data Generation. For tabular
data generation, modified Generative Adversarial Networks IX. C ONCLUSION
and Variational Autoencoders are prevalent. However, the
This survey is the first work to systematically explore
modelling of dependencies and categorical distributions remains
deep neural network approaches for heterogeneous tabular
the key challenge. Novel architectures in this area, such as
data. In this context, we highlighted the main challenges and
diffusion models, have not been adapted to the domain of
research advances in modelling, generating, and explaining
tabular data. Furthermore, the definition of an entirely new
tabular data. We introduced a unified taxonomy that categorizes
generative process particularly focused on tabular data might
deep learning approaches for tabular data into three branches:
be worth investigating.
data transformation methods, specialized architectures, and
Interpretability. Going forward, counterfactual explanations
regularization models. We believe our taxonomy will help
for deep tabular learning can be used to improve the perceived
catalogue future research and better understand and address the
fairness in human-AI interaction scenarios and to enable per-
remaining challenges in applying deep learning to tabular data.
sonalized decision-making [190]. However, the heterogeneity
We hope it will help researchers and practitioners to find the
of tabular data poses problems for counterfactual explanation
most appropriate strategies and methods for their applications.
methods to be reliably deployed in practice. Devising tech-
Additionally, we also conducted an unbiased evaluation of the
niques aimed at effectively handling heterogeneous tabular data
state-of-the-art deep learning approaches on multiple real-world
in the presence of feasibility constraints is still an unsolved
data sets. Deep neural network-based methods for heteroge-
task [157].
neous tabular data are still inferior to machine learning methods
Transfer of Deep Learning Methods to Data Streams.
based on decision tree ensembles for small and medium-
Recent work shows that some of the limitations of neural
sized data sets (less than ∼1M samples). Only for a very
networks in an evolving data stream can be overcome [25],
large data set mainly consisting of continuous and numerical
[212]. Conversely, changes in the parameters of a neural
variables, the deep learning model SAINT outperformed these
network may be effectively used to weigh the importance
classical approaches. Furthermore, we assessed explanation
of input features over time [213] or to detect concept drift
properties of deep learning models with the self-attention
[214]. Accordingly, we argue that deep learning for streaming
mechanism. Although the TabNet model shows promising
data – in particular strategies for dealing with evolving and
explanation explanatory capabilities, inconsistencies between
heterogeneous tabular data – should receive more attention in
the explanations remain an open issue.
the future.
Due to the importance of tabular data to industry and
Transfer Learning for Tabular Data. Reusing knowledge
academia, new ideas in this area are in high demand and
gained solving one problem and applying it to a different task
can have a significant impact. With this review, we hope to
is the research problem addressed by transfer learning. While
provide interested readers with the references and insights they
transfer learning is successfully used in computer vision and
need to address open challenges and effectively advance the
natural language processing applications [215], there are no
field.
efficient and generally accepted ways to do transfer learning
for tabular data. Hence, a general research question can be how
to share knowledge between multiple (related) tabular data sets R EFERENCES
efficiently. [1] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Data Augmentation for Tabular Data. Data augmentation Neural networks, vol. 61, pp. 85–117, 2015.
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,
has proven highly effective to prevent overfitting, especially 2016.
in computer vision [216]. While some data augmentation [3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
techniques for tabular data exist, e.g., SMOTE-NC [217], computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
simple models fail to capture the dependency structure of the Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
data. Therefore, generating additional samples in a continuous in neural information processing systems, 2017, pp. 5998–6008.
[5] S. O. Arik and T. Pfister, “TabNet: Attentive interpretable tabular distillation,” Advances in Neural Information Processing Systems,
learning,” arxiv:1908.07442, 2019. vol. 34, 2020.
[6] S. Popov, S. Morozov, and A. Babenko, “Neural oblivious decision [30] P. Gijsbers, E. LeDell, J. Thomas, S. Poirier, B. Bischl, and J. Van-
ensembles for deep learning on tabular data,” arxiv:1909.06312, 2019. schoren, “An open source AutoML benchmark,” arXiv preprint
[7] R. Shwartz-Ziv and A. Armon, “Tabular Data: Deep Learning is Not arXiv:1907.00909, 2019.
All You Need,” arXiv preprint arXiv:2106.03253, 2021. [31] P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “TaBERT: Pretraining
[8] S. Elsayed, D. Thyssens, A. Rashed, H. S. Jomaa, and L. Schmidt- for joint understanding of textual and tabular data,” arxiv:2005.08314,
Thieme, “Do we really need deep learning models for time series 2020.
forecasting?” arXiv preprint arXiv:2101.02118, 2021. [32] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial net-
[9] G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and works in computer vision: A survey and taxonomy,” arXiv preprint
T. Goldstein, “SAINT: Improved neural networks for tabular data via row arXiv:1906.01529, 2019.
attention and contrastive pre-training,” arXiv preprint arXiv:2106.01342, [33] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
2021. learning: A survey and taxonomy,” IEEE transactions on pattern analysis
[10] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka, “Well-tuned Simple and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
Nets Excel on Tabular Datasets,” in Advances in Neural Information [34] D. Lichtenwalter, P. Burggräf, J. Wagner, and T. Weißer, “Deep
Processing Systems, 2021. multimodal learning for manufacturing problem solving,” Procedia
[11] D. Ulmer, L. Meijerink, and G. Cinà, “Trust issues: Uncertainty CIRP, vol. 99, pp. 615–620, 2021.
estimation does not enable reliable ood detection on medical tabular [35] S. Pölsterl, T. N. Wolf, and C. Wachinger, “Combining 3d image
data,” in Machine Learning for Health. PMLR, 2020, pp. 341–354. and tabular data via the dynamic affine feature map transform,” arXiv
[12] S. Somani, A. J. Russak, F. Richter, S. Zhao, A. Vaid, F. Chaudhry, J. K. preprint arXiv:2107.05990, 2021.
De Freitas, N. Naik, R. Miotto, G. N. Nadkarni et al., “Deep learning [36] D. d. B. Soares, F. Andrieux, B. Hell, J. Lenhardt, J. Badosa,
and the electrocardiogram: review of the current state-of-the-art,” EP S. Gavoille, S. Gaiffas, and E. Bacry, “Predicting the solar potential of
Europace, 2021. rooftops using image segmentation and structured data,” arXiv preprint
[13] V. Borisov, E. Kasneci, and G. Kasneci, “Robust cognitive load detection arXiv:2106.15268, 2021.
from wrist-band sensors,” Computers in Human Behavior Reports, vol. 4, [37] D. Medvedev and A. D’yakonov, “New properties of the data dis-
p. 100116, 2021. tillation method when working with tabular data,” arXiv preprint
[14] J. M. Clements, D. Xu, N. Yousefi, and D. Efimov, “Sequential deep arXiv:2010.09839, 2020.
learning for credit risk monitoring with tabular financial data,” arXiv [38] J. Li, Y. Li, X. Xiang, S.-T. Xia, S. Dong, and Y. Cai, “Tnt: An
preprint arXiv:2012.15330, 2020. interpretable tree-network-tree learning framework using knowledge
[15] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization- distillation,” Entropy, vol. 22, no. 11, p. 1203, 2020.
machine based neural network for ctr prediction,” arXiv preprint [39] D. Roschewitz, M.-A. Hartley, L. Corinzia, and M. Jaggi, “Ifedavg:
arXiv:1703.04247, 2017. Interpretable data-interoperability for federated learning,” arXiv preprint
[16] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recom- arXiv:2107.06580, 2021.
mender system: A survey and new perspectives,” ACM Computing [40] A. Sánchez-Morales, J.-L. Sancho-Gómez, J.-A. Martı́nez-Garcı́a, and
Surveys (CSUR), vol. 52, no. 1, pp. 1–38, 2019. A. R. Figueiras-Vidal, “Improving deep learning performance with
[17] M. Ahmed, H. Afzal, A. Majeed, and B. Khan, “A survey of evolution in missing values via deletion and compensation,” Neural Computing and
predictive models and impacting factors in customer churn,” Advances Applications, vol. 32, no. 17, pp. 13 233–13 244, 2020.
in Data Science and Adaptive Analysis, vol. 9, no. 03, p. 1750007, [41] L. Gondara and K. Wang, “Mida: Multiple imputation using denoising
2017. autoencoders,” in Pacific-Asia conference on knowledge discovery and
[18] Q. Tang, G. Xia, X. Zhang, and F. Long, “A customer churn prediction data mining. Springer, 2018, pp. 260–272.
model based on xgboost and mlp,” in 2020 International Conference [42] R. D. Camino, C. Hammerschmidt et al., “Working with deep generative
on Computer Engineering and Application (ICCEA). IEEE, 2020, pp. models and tabular data imputation,” ICML 2020 Artemiss Workshop,
608–612. 2020.
[19] A. L. Buczak and E. Guven, “A survey of data mining and machine [43] J. Engelmann and S. Lessmann, “Conditional wasserstein gan-based
learning methods for cyber security intrusion detection,” IEEE Commu- oversampling of tabular data for imbalanced learning,” Expert Systems
nications surveys & tutorials, vol. 18, no. 2, pp. 1153–1176, 2015. with Applications, vol. 174, p. 114582, 2021.
[20] F. Cartella, O. Anunciação, Y. Funabiki, D. Yamaguchi, T. Akishita, [44] S. Darabi and Y. Elor, “Synthesising multi-modal minority samples for
and O. Elshocht, “Adversarial attacks for tabular data: application to tabular data,” arXiv preprint arXiv:2105.08204, 2021.
fraud detection and imbalanced data,” CEUR Workshop Proceedings, [45] S. Kamthe, S. Assefa, and M. Deisenroth, “Copula flows for synthetic
vol. 2808, 2021. data generation,” arXiv preprint arXiv:2101.00598, 2021.
[21] B. Liu, M. Ding, S. Shaham, W. Rahayu, F. Farokhi, and Z. Lin, [46] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun,
“When machine learning meets privacy: A survey and outlook,” ACM “Generating Multi-label Discrete Patient Records using Generative
Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–36, 2021. Adversarial Networks,” Machine learning for healthcare conference,
[22] C. J. Urban and K. M. Gates, “Deep learning: A primer for psycholo- pp. 286–305, 2017.
gists.” Psychological Methods, 2021. [47] C. OAG, “Ccpa regulations: Final regulation text.” Office of the Attorney
[23] M. Shoman, A. Aboah, and Y. Adu-Gyamfi, “Deep learning framework General, California Department of Justice, 2021.
for predicting bus delays on multiple routes using heterogenous datasets,” [48] GDPR, “Regulation (eu) 2016/679 of the european parliament and of
Journal of Big Data Analytics in Transportation, vol. 2, no. 3, pp. 275– the council,” Official Journal of the European Union, 2016. [Online].
290, 2020. Available: http://www.privacyregulation.eu/en/13.htm
[24] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for [49] P. Voigt and A. Von dem Bussche, “The eu general data protection
anomaly detection: A review,” ACM Computing Surveys (CSUR), vol. 54, regulation (gdpr),” A Practical Guide, 1st Ed., Cham: Springer
no. 2, pp. 1–38, 2021. International Publishing, vol. 10, p. 3152676, 2017.
[25] D. Sahoo, Q. Pham, J. Lu, and S. C. Hoi, “Online deep learning: Learn- [50] M. Sahakyan, Z. Aung, and T. Rahwan, “Explainable artificial intelli-
ing deep neural networks on the fly,” arXiv preprint arXiv:1711.03705, gence for tabular data: A survey,” IEEE Access, 2021.
2017. [51] B. I. Grisci, M. J. Krause, and M. Dorn, “Relevance aggregation for
[26] X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state-of-the-art,” neural networks interpretability and knowledge discovery on tabular
Knowledge-Based Systems, vol. 212, p. 106622, 2021. data,” Information Sciences, vol. 559, pp. 111–129, 2021.
[27] M. Artzi, E. Redmard, O. Tzemach, J. Zeltser, O. Gropper, J. Roth, [52] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh,
B. Shofty, D. A. Kozyrev, S. Constantini, and L. Ben-Sira, “Classification R. Puri, J. M. Moura, and P. Eckersley, “Explainable machine learning
of pediatric posterior fossa tumors using convolutional neural network in deployment,” in Proceedings of the 2020 conference on fairness,
and tabular data,” IEEE Access, vol. 9, pp. 91 966–91 973, 2021. accountability, and transparency, 2020, pp. 648–657.
[28] X. Shi, J. Mueller, N. Erickson, M. Li, and A. Smola, “Multimodal [53] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
AutoML on structured tables with text fields,” in 8th ICML Workshop in Proceedings of the 22nd ACM SIGKDD International Conference
on Automated Machine Learning (AutoML), 2021. on Knowledge Discovery and Data Mining (KDD), 2016, pp. 785–794.
[29] R. Fakoor, J. W. Mueller, N. Erickson, P. Chaudhari, and A. J. Smola, [54] J. T. Hancock and T. M. Khoshgoftaar, “Survey on categorical data for
“Fast, accurate, and simple models for tabular data via augmented neural networks,” Journal of Big Data, vol. 7, pp. 1–41, 2020.
[55] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting [81] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht,
deep learning models for tabular data,” arXiv preprint arXiv:2106.11959, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,”
2021. in International Conference on Machine Learning. PMLR, 2019, pp.
[56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 5301–5310.
recognition,” in Proceedings of the IEEE conference on computer vision [82] B. R. Mitchell et al., “The spatial inductive bias of deep learning,” Ph.D.
and pattern recognition, 2016, pp. 770–778. dissertation, Johns Hopkins University, 2017.
[57] L. Katzir, G. Elidan, and R. El-Yaniv, “Net-DNF: Effective deep [83] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
modeling of tabular data,” in International Conference on Learning Press, 2016.
Representations, 2021. [84] Y. Gorishniy, I. Rubachev, and A. Babenko, “On embeddings for numer-
[58] R. U. David M. Lane, Introduction to Statistics. David Lane, 2003. ical features in tabular deep learning,” arXiv preprint arXiv:2203.05556,
[59] M. Ryan, Deep learning with structured data. Simon and Schuster, 2022.
2020. [85] E. Fitkov-Norris, S. Vahid, and C. Hand, “Evaluating the impact of
[60] M. W. Cvitkovic et al., “Deep learning in unconventional domains,” categorical data encoding and scaling on neural network classification
Ph.D. dissertation, California Institute of Technology, 2020. performance: the case of repeat consumption of identical cultural goods,”
[61] D. Dua and C. Graff, “UCI machine learning repository,” 2017. in International Conference on Engineering Applications of Neural
[Online]. Available: http://archive.ics.uci.edu/ml Networks. Springer, 2012, pp. 343–352.
[62] A. J. Miles, “The sunstroke epidemic of cincinnati, ohio, during the [86] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque,
summer of 1881,” Public health papers and reports, vol. 7, p. 293, S. Haykal, M. Ispir, V. Jain, L. Koc et al., “TFX: A tensorflow-based
1881. production-scale machine learning platform,” in Proceedings of the 23rd
[63] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” ACM SIGKDD International Conference on Knowledge Discovery and
Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936. Data Mining, 2017, pp. 1387–1395.
[64] D. A. Jdanov, D. Jasilionis, V. M. Shkolnikov, and M. Barbieri, “Human [87] B. Sun, L. Yang, W. Zhang, M. Lin, P. Dong, C. Young, and J. Dong,
mortality database,” Encyclopedia of gerontology and population ag- “Supertml: Two-dimensional word embedding for the precognition on
ing/editors Danan Gu, Matthew E. Dupre. Cham: Springer International structured tabular data,” in Proceedings of the IEEE/CVF Conference
Publishing, 2020, 2019. on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
[65] E. Fix, Discriminatory analysis: nonparametric discrimination, consis- [88] J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “Vime: Extending
tency properties. USAF school of Aviation Medicine, 1951. the success of self-and semi-supervised learning to tabular domaindim,”
[66] C. L. Giles, C. B. Miller, D. Chen, H.-H. Chen, G.-Z. Sun, and Y.-C. Advances in Neural Information Processing Systems, vol. 33, 2020.
Lee, “Learning and extracting finite state automata with second-order [89] D. Bahri, H. Jiang, Y. Tay, and D. Metzler, “SCARF: Self-supervised
recurrent neural networks,” Neural Computation, vol. 4, no. 3, pp. contrastive learning using random feature corruption,” arXiv preprint
393–405, 1992. arXiv:2106.15147, 2021.
[67] B. G. Horne and C. L. Giles, “An experimental comparison of recurrent [90] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye,
neural networks,” Advances in neural information processing systems, G. Anderson, G. Corrado, W. Chai, M. Ispir et al., “Wide & deep
pp. 697–704, 1995. learning for recommender systems,” in Proceedings of the 1st workshop
[68] L. Willenborg and T. De Waal, Statistical disclosure control in practice. on deep learning for recommender systems, 2016, pp. 7–10.
Springer Science & Business Media, 1996, vol. 111. [91] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision
[69] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks: tree,” arXiv preprint arXiv:1711.09784, 2017.
estimating the click-through rate for new ads,” in Proceedings of the
[92] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “xdeepfm:
16th international conference on World Wide Web, 2007, pp. 521–530.
Combining explicit and implicit feature interactions for recommender
[70] G. Ke, Z. Xu, J. Zhang, J. Bian, and T.-Y. Liu, “Deepgbm: A deep
systems,” in Proceedings of the 24th ACM SIGKDD International
learning framework distilled by gbdt for online prediction tasks,” in
Conference on Knowledge Discovery & Data Mining, 2018, pp. 1754–
Proceedings of the 25th ACM SIGKDD International Conference on
1763.
Knowledge Discovery & Data Mining, 2019, pp. 384–394.
[93] G. Ke, J. Zhang, Z. Xu, J. Bian, and T.-Y. Liu, “TabNN: A universal
[71] Z. Wang, Q. She, and J. Zhang, “Masknet: Introducing feature-
neural network solution for tabular data,” 2018.
wise multiplication to ctr ranking models by instance-guided mask,”
arXiv:2102.07619, 2021. [94] Y. Luo, H. Zhou, W.-W. Tu, Y. Chen, W. Dai, and Q. Yang, “Network
[72] I. Shavitt and E. Segal, “Regularization learning networks: deep learning on network for tabular data classification in real-world applications,”
for tabular datasets,” in Advances in Neural Information Processing in Proceedings of the 43rd International ACM SIGIR Conference on
Systems, 2018, pp. 1379–1389. Research and Development in Information Retrieval, 2020, pp. 2317–
2326.
[73] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models [95] Z. Liu, Q. Liu, H. Zhang, and Y. Chen, “Dnn2lr: Interpretation-
are few-shot learners,” arXiv: 2005.14165, 2020. inspired feature crossing for real-world tabular data,” arXiv preprint
[74] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, arXiv:2008.09775, 2020.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, [96] S. Ivanov and L. Prokhorenkova, “Boost then Convolve: Gradient
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- Boosting Meets Graph Neural Networks,” in Interational Conference
formers for image recognition at scale,” in International Conference on on Learning Representations, 2021.
Learning Representations, 2021. [97] H. Luo, F. Cheng, H. Yu, and Y. Yi, “SDTR: Soft Decision Tree
[75] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, Regressor for Tabular Data,” IEEE Access, vol. 9, pp. 55 999–56 011,
“Transformers in vision: A survey,” arXiv preprint arXiv:2101.01169, 2021.
2021. [98] X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTrans-
[76] A. F. Karr, A. P. Sanil, and D. L. Banks, “Data quality: A statistical former: Tabular Data Modeling Using Contextual Embeddings,”
perspective,” Statistical Methodology, vol. 3, no. 2, pp. 137–173, 2006. arxiv:2012.06678, 2020.
[77] L. Xu and K. Veeramachaneni, “Synthesizing Tabular Data using [99] S. Cai, K. Zheng, G. Chen, H. Jagadish, B. C. Ooi, and M. Zhang,
Generative Adversarial Networks,” arXiv preprint arXiv:1811.11264, “Arm-net: Adaptive relation modeling network for structured data,” in
2018. Proceedings of the 2021 International Conference on Management of
[78] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and Data, 2021, pp. 207–220.
T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision [100] J. Kossen, N. Band, C. Lyle, A. Gomez, T. Rainforth, and Y. Gal, “Self-
tree,” in Advances in neural information processing systems, 2017, pp. attention between datapoints: Going beyond individual input-output
3146–3154. pairs in deep learning,” in Advances in Neural Information Processing
[79] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan,
“Catboost: unbiased boosting with categorical features,” in Advances in Eds., 2021.
neural information processing systems, 2018, pp. 6638–6648. [101] D. Micci-Barreca, “A preprocessing scheme for high-cardinality cate-
[80] Y. Zhu, T. Brettin, F. Xia, A. Partin, M. Shukla, H. Yoo, Y. A. Evrard, gorical attributes in classification and prediction problems,” SIGKDD
J. H. Doroshow, and R. L. Stevens, “Converting tabular data into images Explor., vol. 3, pp. 27–32, 2001.
for deep learning with convolutional neural networks,” Scientific Reports, [102] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics
vol. 11, no. 1, pp. 1–11, 2021. & data analysis, vol. 38, no. 4, pp. 367–378, 2002.
[103] P. Langley and S. Sage, “Oblivious decision trees and abstract cases,” [129] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, “Adversarial ranking
in Working notes of the AAAI-94 workshop on case-based reasoning. for language generation,” Advances in Neural Information Processing
Seattle, WA, 1994, pp. 113–117. Systems, 2017.
[104] B. Peters, V. Niculae, and A. F. Martins, “Sparse sequence-to-sequence [130] S. Subramanian, S. Rajeswar, F. Dutil, C. Pal, and A. Courville,
models,” arxiv:1905.05702, 2019. “Adversarial generation of natural language,” in Proceedings of the
[105] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. 2nd Workshop on Representation Learning for NLP, 2017, pp. 241–251.
[Online]. Available: http://yann.lecun.com/exdb/mnist/ [131] N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,”
[106] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” in 2016 IEEE International Conference on Data Science and Advanced
arXiv preprint arXiv:1604.06737, 2016. Analytics (DSAA). IEEE, 2016, pp. 399–410.
[107] S. Rendle, “Factorization machines,” in 2010 IEEE International [132] Z. Li, Y. Zhao, and J. Fu, “Sync: A copula based framework for gener-
conference on data mining. IEEE, 2010, pp. 995–1000. ating synthetic data from aggregated sources,” in 2020 International
[108] F. Moosmann, B. Triggs, and F. Jurie, “Fast discriminative visual Conference on Data Mining Workshops (ICDMW). IEEE, 2020, pp.
codebooks using randomized clustering forests,” in Twentieth Annual 571–578.
Conference on Neural Information Processing Systems (NIPS’06). MIT [133] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao,
Press, 2006, pp. 985–992. “Privbayes: Private data release via bayesian networks,” ACM Transac-
[109] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, tions on Database Systems (TODS), vol. 42, no. 4, pp. 1–41, 2017.
R. Herbrich, S. Bowers et al., “Practical lessons from predicting [134] C. Chow and C. Liu, “Approximating discrete probability distributions
clicks on ads at facebook,” in Proceedings of the Eighth International with dependence trees,” IEEE transactions on Information Theory,
Workshop on Data Mining for Online Advertising, 2014, pp. 1–9. vol. 14, no. 3, pp. 462–467, 1968.
[110] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, [135] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
“The graph neural network model,” IEEE transactions on neural S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
networks, vol. 20, no. 1, pp. 61–80, 2008. arXiv preprint arXiv:1406.2661, 2014.
[111] C. Wang, M. Li, and A. J. Smola, “Language models with transformers,” [136] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
arXiv preprint arXiv:1904.09408, 2019. learning with deep convolutional generative adversarial networks,” in
[112] A. F. T. Martins and R. F. Astudillo, “From softmax to sparse- 4th International Conference on Learning Representations, ICLR 2016,
max: A sparse model of attention and multi-label classification,” 2016.
arxiv:1602.02068, 2016. [137] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
[113] G. Van Rossum and F. L. Drake Jr, Python reference manual. Centrum adversarial networks,” in International conference on machine learning.
voor Wiskunde en Informatica Amsterdam, 1995. PMLR, 2017, pp. 214–223.
[138] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville,
[114] M. Joseph, “Pytorch tabular: A framework for deep learning with tabular
“Improved training of wasserstein gans,” in Proceedings of the 31st
data,” arXiv preprint arXiv:2104.13638, 2021.
International Conference on Neural Information Processing Systems,
[115] S. Boughorbel, F. Jarray, and A. Kadri, “Fairness in tabnet model
2017, pp. 5769–5779.
by disentangled representation for the prediction of hospital no-show,”
[139] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshmi-
arXiv preprint arXiv:2103.04048, 2021.
narayanan, S. Hoyer, and R. Munos, “The cramer distance as a solution
[116] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, to biased wasserstein gradients,” 2017.
“A survey on bias and fairness in machine learning,” ACM Computing
[140] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Ben-
Surveys (CSUR), vol. 54, no. 6, pp. 1–35, 2021.
gio, “Boundary-seeking generative adversarial networks,” International
[117] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Conference on Learning Representations, 2018.
Regularization strategy to train strong classifiers with localizable [141] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton,
features,” in Proceedings of the IEEE/CVF International Conference “Veegan: Reducing mode collapse in gans using implicit variational
on Computer Vision, 2019, pp. 6023–6032. learning,” in Proceedings of the 31st International Conference on Neural
[118] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Information Processing Systems, 2017, pp. 3310–3320.
empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017. [142] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2nd
[119] V. Borisov, J. Haug, and G. Kasneci, “CancelOut: A layer for feature International Conference on Learning Representations, ICLR 2014 -
selection in deep neural networks,” in International Conference on Conference Track Proceedings, no. Ml, pp. 1–14, 2014.
Artificial Neural Networks. Springer, 2019, pp. 72–83. [143] C. Ma, S. Tschiatschek, R. Turner, J. M. Hernández-Lobato, and
[120] G. Valdes, W. Arbelo, Y. Interian, and J. H. Friedman, “Lockout: Sparse C. Zhang, “Vaem: a deep generative model for heterogeneous mixed
regularization of neural networks,” arXiv preprint arXiv:2107.07160, type data,” Advances in Neural Information Processing Systems, vol. 33,
2021. 2020.
[121] J. Fiedler, “Simple modifications to improve tabular neural networks,” [144] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni,
arXiv preprint arXiv:2108.03214, 2021. “Modeling tabular data using conditional GAN,” in Advances in Neural
[122] K. Lounici, K. Meziani, and B. Riu, “Muddling label regularization: Information Processing Systems, vol. 33, 2019.
Deep learning for tabular datasets,” arXiv preprint arXiv:2106.04462, [145] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim,
2021. “Data synthesis based on generative adversarial networks,” Proceedings
[123] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model of the VLDB Endowment, vol. 11, no. 10, pp. 1071–1083, 2018.
predictions,” NeurIPS, 2017. [146] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with
[124] H. Chen, S. Jajodia, J. Liu, N. Park, V. Sokolov, and V. Subrahmanian, gumbel-softmax,” in International Conference on Learning Representa-
“Faketables: Using gans to generate functional dependency preserving tions, 2017.
tables with bounded real data.” in IJCAI, 2019, pp. 2074–2080. [147] B. Wen, L. O. Colon, K. Subbalakshmi, and R. Chandramouli, “Causal-
[125] M. Quintana and C. Miller, “Towards class-balancing human comfort TGAN: Generating tabular data using causal generative adversarial
datasets with gans,” in Proceedings of the 6th ACM International networks,” arXiv preprint arXiv:2104.10680, 2021.
Conference on Systems for Energy-Efficient Buildings, Cities, and [148] J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating
Transportation, 2019, pp. 391–392. synthetic data with differential privacy guarantees,” in International
[126] A. Koivu, M. Sairanen, A. Airola, and T. Pahikkala, “Synthetic minority conference on learning representations, 2018.
oversampling of vital statistics data with generative adversarial networks,” [149] A. Mottini, A. Lheritier, and R. Acuna-Agost, “Airline Passenger Name
Journal of the American Medical Informatics Association, vol. 27, no. 11, Record Generation using Generative Adversarial Networks,” arXiv
pp. 1667–1674, 2020. preprint arXiv:1807.06657, 2018.
[127] J. Fan, J. Chen, T. Liu, Y. Shen, G. Li, and X. Du, “Relational [150] R. Camino, C. Hammerschmidt, and R. State, “Generating multi-
data synthesis using generative adversarial networks: A design space categorical samples with generative adversarial networks,” ICML work-
exploration,” Proc. VLDB Endow., vol. 13, no. 12, p. 1962–1975, Jul. shop on Theoretical Foundations and Applications of Deep Generative
2020. Models, 2018.
[128] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, [151] M. K. Baowaly, C. C. Lin, C. L. Liu, and K. T. Chen, “Synthesiz-
“Analyzing and improving the image quality of stylegan,” in Proceed- ing electronic health records using improved generative adversarial
ings of the IEEE/CVF Conference on Computer Vision and Pattern networks,” Journal of the American Medical Informatics Association,
Recognition, 2020, pp. 8110–8119. vol. 26, no. 3, pp. 228–241, 2019.
[152] L. V. H. Vardhan and S. Kok, “Generating privacy-preserving synthetic [176] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam,
tabular data using oblivious variational autoencoders,” in Proceedings and P. Das, “Explanations based on the missing: Towards contrastive ex-
of the Workshop on Economics of Privacy and Data Labor at the 37 planations with pertinent negatives,” in Advances in Neural Information
th International Conference on Machine Learning, 2020. Processing Systems (NeurIPS), 2018.
[153] Z. Zhao, A. Kunar, H. Van der Scheer, R. Birke, and L. Y. Chen, “Ctab- [177] B. Mittelstadt, C. Russell, and S. Wachter, “Explaining explanations in
gan: Effective table data synthesizing,” arXiv preprint arXiv:2102.08369, ai,” in Proceedings of the conference on fairness, accountability, and
2021. transparency, 2019.
[154] F. J. Massey Jr, “The kolmogorov-smirnov test for goodness of fit,” [178] R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning
Journal of the American statistical Association, vol. 46, no. 253, pp. classifiers through diverse counterfactual explanations,” in FAT*, 2020.
68–78, 1951. [179] M. Pawelczyk, K. Broelemann, and G. Kasneci, “Learning model-
[155] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and agnostic counterfactual explanations for tabular data,” in The Web
D. Pedreschi, “A survey of methods for explaining black box models,” Conference 2020 (WWW). ACM, 2020.
ACM Comput. Surv., vol. 51, no. 5, 2018. [180] M. Downs, J. L. Chu, Y. Yacoby, F. Doshi-Velez, and W. Pan, “Cruds:
[156] K. Gade, S. C. Geyik, K. Kenthapadi, V. Mithal, and A. Taly, Counterfactual recourse using disentangled subspaces,” ICML Workshop
“Explainable ai in industry,” in Proceedings of the ACM SIGKDD on Human Interpretability in Machine Learning (WHI 2020), 2020.
International Conference on Knowledge Discovery and Data Mining [181] S. Joshi, O. Koyejo, W. Vijitbenjaronk, B. Kim, and J. Ghosh, “Towards
(KDD), 2019. realistic individual recourse and actionable explanations in black-box
[157] M. Pawelczyk, S. Bielawski, J. Van den Heuvel, T. Richter, and decision making systems,” arXiv preprint arXiv:1907.09615, 2019.
G. Kasneci, “Carla: A python library to benchmark algorithmic recourse [182] D. Mahajan, C. Tan, and A. Sharma, “Preserving causal constraints
and counterfactual explanation algorithms,” in Advances in Neural in counterfactual explanations for machine learning classifiers,” arXiv
Information Processing Systems (NeurIPS) Benchmark and Datasets preprint arXiv:1912.03277, 2019.
Track, 2021. [183] M. Pawelczyk, K. Broelemann, and G. Kasneci, “On counterfactual
[158] Y. Lou, R. Caruana, and J. Gehrke, “Intelligible models for classification explanations under predictive multiplicity,” in Conference on Uncertainty
and regression,” in Proceedings of the ACM SIGKDD International in Artificial Intelligence (UAI). PMLR, 2020.
Conference on Knowledge Discovery and Data Mining (KDD), 2012. [184] J. Antorán, U. Bhatt, T. Adel, A. Weller, and J. M. Hernández-Lobato,
[159] D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretability “Getting a clue: A method for explaining uncertainty estimates,” ICLR,
with self-explaining neural networks,” NeurIPS, 2018. 2021.
[160] D. Wang, Q. Yang, A. Abdul, and B. Y. Lim, “Designing theory-driven [185] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera, “Handling
user-centric explainable ai,” in CHI, 2019. incomplete heterogeneous data using vaes,” Pattern Recognition, 2020.
[161] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, [186] S. Upadhyay, S. Joshi, and H. Lakkaraju, “Towards robust and reliable
and W. Samek, “On pixel-wise explanations for non-linear classifier algorithmic recourse,” in Advances in Neural Information Processing
decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, Systems (NeurIPS), vol. 34, 2021.
p. e0130140, 2015. [187] R. Dominguez-Olmedo, A.-H. Karimi, and B. Schölkopf, “On the
[162] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller, adversarial robustness of causal algorithmic recourse,” in International
“Layer-wise relevance propagation: an overview,” Explainable AI: Conference on Machine Learning (ICML), 2022.
interpreting, explaining and visualizing deep learning, pp. 193–209, [188] M. Pawelczyk, T. Datta, J. van-den Heuvel, G. Kasneci, and
2019. H. Lakkaraju, “Algorithmic recourse in the face of noisy human
[163] G. Kasneci and T. Gottron, “Licon: A linear weighting scheme for the responses,” arXiv:2203.06768, 2022.
contribution ofinput variables in deep artificial neural networks,” in [189] A.-H. Karimi, G. Barthe, B. Schölkopf, and I. Valera, “A survey of
CIKM, 2016. algorithmic recourse: definitions, formulations, solutions, and prospects,”
[164] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep arXiv preprint arXiv:2010.04050, 2020.
networks,” in International Conference on Machine Learning. PMLR, [190] S. Verma, J. Dickerson, and K. Hines, “Counterfactual explanations for
2017, pp. 3319–3328. machine learning: A review,” 2020.
[165] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubrama- [191] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
nian, “Grad-cam++: Generalized gradient-based visual explanations for 2009.
deep convolutional networks,” in 2018 IEEE Winter Conference on [192] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
Applications of Computer Vision (WACV). IEEE, 2018. A large-scale hierarchical image database,” in 2009 IEEE conference
[166] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
explaining the predictions of any classifier,” in Proceedings of the 22nd [193] FICO, “Home equity line of credit (HELOC) dataset,” https://community.
ACM SIGKDD international conference on knowledge discovery and fico.com/s/explainable-machine-learning-challenge, 2019 (accessed
data mining, 2016, pp. 1135–1144. June 15, 2022).
[167] M. T. Ribeiro, S. Singh, and C.Guestrin, “Anchors: High-precision [194] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles
model-agnostic explanations,” in AAAI, 2018. in high-energy physics with deep learning,” Nature communications,
[168] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, vol. 5, no. 1, pp. 1–9, 2014.
B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee, “From [195] C. Z. Mooney, Monte carlo simulation. Sage, 1997, no. 116.
local explanations to global understanding with explainable ai for trees,” [196] R. K. Pace and R. Barry, “Sparse spatial autoregressions,” Statistics &
Nature machine intelligence, 2020. Probability Letters, vol. 33, no. 3, pp. 291–297, 1997.
[169] J. Haug, S. Zürn, P. El-Jiz, and G. Kasneci, “On baselines for local [197] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification
feature attributions,” arXiv: 2101.00905, 2021. and regression trees. Routledge, 2017.
[170] Y. Liu, S. Khandagale, C. White, and W. Neiswanger, “Synthetic [198] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
benchmarks for scientific research in explainable machine learning,” 5–32, 2001.
in Advances in Neural Information Processing Systems (NeurIPS) [199] K. Broelemann and G. Kasneci, “A gradient-based split criterion for
Benchmark and Datasets Track, 2021. highly accurate and transparent model trees,” in IJCAI, 2019.
[171] S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual explanations [200] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent
without opening the black box: automated decisions and the gdpr,” in nervous activity,” The bulletin of mathematical biophysics, vol. 5,
Harvard Journal of Law & Technology, vol. 31, no. 2, 2018. no. 4, pp. 115–133, 1943.
[172] B. Ustun, A. Spangher, and Y. Liu, “Actionable recourse in linear [201] Y. Yamada, O. Lindenbaum, S. Negahban, and Y. Kluger, “Feature
classification,” in (FAT*), 2019. selection using stochastic gates,” in Proceedings of Machine Learning
[173] C. Russell, “Efficient search for diverse coherent explanations,” in and Systems 2020, 2020, pp. 8952–8963.
(FAT*), 2019. [202] R. Agarwal, N. Frosst, X. Zhang, R. Caruana, and G. E. Hinton, “Neural
[174] K. Rawal and H. Lakkaraju, “Beyond individualized recourse: Inter- additive models: Interpretable machine learning with neural nets,” arXiv
pretable and interactive summaries of actionable recourses,” in NeurIPS, preprint arXiv:2004.13912, 2020.
2020. [203] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-
[175] A.-H. Karimi, G. Barthe, B. Balle, and I. Valera, “Model-agnostic generation hyperparameter optimization framework,” in Proceedings
counterfactual explanations for consequential decisions,” in AISTATS, of the 25rd ACM SIGKDD International Conference on Knowledge
2020. Discovery and Data Mining, 2019.
[204] D. Merkel, “Docker: lightweight linux containers for consistent de-

velopment and deployment,” Linux journal, vol. 2014, no. 239, p. 2,
2014.
[205] C. S. Bojer and J. P. Meldgaard, “Kaggle forecasting competitions: An
overlooked learning opportunity,” International Journal of Forecasting,
vol. 37, no. 2, pp. 587–603, 2021.
[206] Y. Rong, T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci, “A
consistent and efficient evaluation strategy for feature attribution
methods,” in International Conference on Machine Learning. PMLR,
2022.
[207] R. Tomsett, D. Harborne, S. Chakraborty, P. Gurram, and A. Preece,
“Sanity checks for saliency metrics,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6021–
6029.
[208] E. Ntoutsi, P. Fafalios, U. Gadiraju, V. Iosifidis, W. Nejdl, M.-E. Vidal,
S. Ruggieri, F. Turini, S. Papadopoulos, E. Krasanakis et al., “Bias
in data-driven artificial intelligence systems—an introductory survey,”
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
vol. 10, no. 3, p. e1356, 2020.
[209] P. Domingos and G. Hulten, “Mining high-speed data streams,” in
Proceedings of the sixth ACM SIGKDD international conference on
Knowledge discovery and data mining, 2000, pp. 71–80.
[210] C. Manapragada, G. I. Webb, and M. Salehi, “Extremely fast decision
tree,” in Proceedings of the 24th ACM SIGKDD International Confer-
ence on Knowledge Discovery & Data Mining, 2018, pp. 1953–1962.
[211] R. Balestriero, L. Bottou, and Y. LeCun, “The effects of regular-
ization and data augmentation are class dependent,” arXiv preprint
arXiv:2204.03632, 2022.
[212] P. Duda, M. Jaworski, A. Cader, and L. Wang, “On training deep neural
networks using a streaming approach,” Journal of Artificial Intelligence
and Soft Computing Research, vol. 10, 2020.
[213] J. Haug, M. Pawelczyk, K. Broelemann, and G. Kasneci, “Leveraging
model inherent variable importance for stable online feature selection,”
in Proceedings of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, 2020, pp. 1478–1502.
[214] J. Haug and G. Kasneci, “Learning parameter distributions to detect
concept drift in data streams,” in 2020 25th International Conference
on Pattern Recognition (ICPR). IEEE, 2021, pp. 9452–9459.
[215] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on
deep transfer learning,” in International conference on artificial neural
networks. Springer, 2018, pp. 270–279.
[216] C. Shorten and T. M. Khoshgoftaar, “A survey on image data
augmentation for deep learning,” Journal of Big Data, vol. 6, no. 1, pp.
1–48, 2019.
[217] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
“Smote: synthetic minority over-sampling technique,” Journal of artificial
intelligence research, vol. 16, pp. 321–357, 2002.
[218] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep
neural networks: A survey,” IEEE transactions on pattern analysis and
machine intelligence, 2020.
[219] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang,
“Self-supervised learning: Generative or contrastive,” IEEE Transactions
on Knowledge and Data Engineering, 2021.
[220] T. Ucar, E. Hajiramezanali, and L. Edwards, “Subtab: Subsetting features
of tabular data for self-supervised representation learning,” Advances
in Neural Information Processing Systems, vol. 34, 2021.

2110.01889

Uploaded by

Copyright:

Available Formats

2110.01889

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2110.01889

Uploaded by

Copyright:

Available Formats

SUBMITTED TO THE IEEE, JUNE 2022 1

Deep Neural Networks and Tabular Data: A Survey

II. R ELATED W ORK Deep Learning for Tabular Data

Age Education Occupation Sex Income

Method Interpretability Key Characteristics

SuperTML [87] Transform tabular data into images for CNNs

Wide&Deep [90] Embedding layer for categorical features

TabNN [93] DNNs based on feature groups distilled from GBDT

TabNet [5] X Sequential attention structure

TabTransformer [98] X Transformer network for categorical data

RLN [72] Hyperparameters regularization scheme

Attention-based architectures offer mechanisms for inter- C. Regularization Models

the enormous number of approaches, we list the most influential

samples (one per line)

Method Based upon Application

medGAN [46] Autoencoder+GAN Medical Records

CTGAN, TVAE [144] Wasserstein GAN, VAE General

Causal-TGAN [153] Causal-Model, WGAN-GP General

the recommendation aspect, counterfactual explanation methods A. Data Sets

HELOC Adult HIGGS Covertype Cal. Housing

Model Trees Model Trees

0.78 VIME CatBoost XGBoost

Model Trees Model Trees

102 103 104 105 106 100 101 102 103

0.84 TabTransformer, diag. attn.

Fig. 5: A size comparison of deep learning models on the

[204] D. Merkel, “Docker: lightweight linux containers for consistent de-

You might also like