Proceedings of Machine Learning Research LEAVE UNSET:1–28, 2022
Conference on Health, Inference, and Learning (CHIL) 2022
Data Augmentation for Electrocardiograms
Aniruddh Raghu
[email protected]
Massachusetts Institute of Technology, USA
Divya Shanmugam
[email protected]
Massachusetts Institute of Technology, USA
Eugene Pomerantsev
[email protected]
arXiv:2204.04360v1 [cs.LG] 9 Apr 2022
Massachusetts General Hospital, USA
John Guttag
[email protected]
Massachusetts Institute of Technology, USA
Collin M. Stultz
[email protected]
Massachusetts Institute of Technology, USA
Abstract
Neural network models have demonstrated impressive performance in predicting pathologies
and outcomes from the 12-lead electrocardiogram (ECG). However, these models often need
to be trained with large, labelled datasets,
which are not available for many predictive
tasks of interest. In this work, we perform
an empirical study examining whether training time data augmentation methods can be
used to improve performance on such datascarce ECG prediction problems. We investigate how data augmentation strategies impact
model performance when detecting cardiac abnormalities from the ECG. Motivated by our
finding that the effectiveness of existing augmentation strategies is highly task-dependent,
we introduce a new method, TaskAug, which
defines a flexible augmentation policy that is
optimized on a per-task basis. We outline an
efficient learning algorithm to do so that leverages recent work in nested optimization and implicit differentiation. In experiments, considering three datasets and eight predictive tasks, we
find that TaskAug is competitive with or improves on prior work, and the learned policies
shed light on what transformations are most effective for different tasks. We distill key insights
from our experimental evaluation, generating a
set of best practices for applying data augmentation to ECG prediction problems.
Data and Code Availability We use three
datasets: two are from Massachusetts General Hospital (MGH) and are not publicly available; the third
is PTB-XL (Wagner et al., 2020), which is publicly
available on the PhysioNet repository (Goldberger
et al., 2000). Code implementing our method is available here: https://github.com/aniruddhraghu/
ecg_aug.
1. Introduction
Electrocardiography is used widely in medicine as
a non-invasive and relatively inexpensive method of
measuring the electrical activity in an individual’s
heart. The output of electrocardiography — the electrocardiogram (ECG) — is of great utility to clinicians in diagnosing and monitoring various cardiovascular conditions (Salerno et al., 2003; Fesmire et al.,
1998; Blackburn et al., 1960).
In recent years, there has been significant interest in automatically predicting cardiac abnormalities, diseases, and outcomes directly from ECGs using
neural network models (Hannun et al., 2019; Raghunath et al., 2020; Gopal et al., 2021; Diamant et al.,
2021; Kiyasseh et al., 2021; Raghu et al., 2021a). Although these works demonstrate impressive results,
they often require large labelled datasets with paired
ECGs and labels to train models. In certain situations, it is challenging to construct such datasets. For
example, consider inferring abnormal central hemodynamics (e.g., cardiac output) from the ECG, which
is important when monitoring patients with heart
failure or pulmonary hypertension (Schlesinger et al.,
2021). Accurate hemodynamics labels are only obtainable through specialized invasive studies (Bajorat et al., 2006; Hiemstra et al., 2019) and hence it
© 2022 A. Raghu, D. Shanmugam, E. Pomerantsev, J. Guttag & C.M. Stultz.
Data Augmentation for Electrocardiograms
Task 1: RVH Detection
80
Task 2: AFib Detection
ferent predictive tasks, which cover different classes
of cardiac abnormalities.
• We analyze the results from our evaluation, finding that many augmentation strategies do not work
well across all tasks. Given its task-specific nature,
TaskAug is competitive with or improves on other
methods for the problems we examined.
• We study the learned TaskAug policies, finding
that they offer insights as to what augmentations
are most appropriate for different tasks.
• We provide a summary of findings and best practices to assist future studies exploring data augmentation for ECG tasks.
80
AUROC
AUROC
75
75
70
65
No Augmentations
70
With SpecAugment No Augmentations
With SpecAugment
Figure 1: The effect of data augmentation on ECG
prediction tasks is task-dependent. We
examine the mean/standard error of AUROC
over 5 runs when applying SpecAugment (Park
et al., 2019), a data augmentation method,
to two different ECG prediction tasks. We
observe performance improvement in one setting (left, Right Ventricular Hypertrophy),
and performance reduction in another (right,
Atrial Fibrillation).
2. Related Work
Data augmentation for time-series. Prior research on time-series data augmentation includes: (1)
large-scale surveys exploring the impact of augmentation on various downstream modalities (Iwana and
Uchida,
2021a,b; Wen et al., 2020); and (2) specific
is difficult to obtain large datasets with paired ECGs
methods
for particular modalities, including speech
and hemodynamics variables.
et al., 2019, 2020), wearable device sigsignals
(Park
Data augmentation (Hataya et al., 2020; Wen et al.,
et
al.,
2017), and time series forecasting
nals
(Um
2020; Shorten and Khoshgoftaar, 2019; Iwana and
(Bandara
et
al.,
2021;
Smyl and Kuber, 2016). There
Uchida, 2021a; Cubuk et al., 2019, 2020) during trainis
relatively
little
work
exploring how augmentation
ing is a useful strategy to improve the predictive percan
impact
performance
for ECG-based prediction
formance of models in data-scarce regimes. However,
tasks,
with
prior
studies
mostly
restricted to considerthere exists limited work studying data augmentation
et
al.,
2020; Banerjee and
ing
single
tasks
(Hatamian
for ECGs. A key problem with applying standard
2021).
In
contrast,
in
this
paper, we evaluate
Ghose,
data augmentations is that fine-grained information
a
set
of
data
augmentation
methods
on many different
within ECGs, such as relative amplitudes of portions
predictive
tasks,
studying
when
and
why augmentaof beats, carry predictive signal: augmentations may
tions
may
help.
In
addition,
the
data
augmentation
worsen performance if such predictive signal is destrategy
proposed
in
this
work,
TaskAug,
can be readstroyed. Furthermore, the effectiveness of data augily
adapted
to
new
predictive
tasks,
unlike
in existing
mentations with ECGs varies on a task-specific basis
works
where
the
methods
may
be
designed
for a very
– applying the same augmentation for two different
specific
downstream
task.
tasks could help performance in one case, and hurt
There also exists related work on using data augperformance in another (Figure 1).
mentation for contrastive pre-training with ECGs
In this work, we take steps towards addressing
(Gopal et al., 2021; Kiyasseh et al., 2021; Raghu
these issues. Our contributions are as follows:
et al., 2021b; Mehari and Strodthoff, 2021). These
• We propose TaskAug, a new task-dependent aug- works are complementary to ours; we focus specifimentation strategy. TaskAug defines a flexible aug- cally on supervised learning (rather than contrastive
mentation policy that is optimized on a per-task pre-training), and we hypothesize that our proposed
basis. We outline an efficient learning algorithm augmentation pipeline could be used in these prior
to do so that leverages recent work in nested op- methods for improved contrastive learning.
timization and implicit differentiation. (Lorraine
et al., 2020).
Designing and learning data augmentation
• We conduct an empirical study of TaskAug and policies. The structure of TaskAug, our proposed
other augmentation strategies on ECG predictive augmentation strategy, was inspired by related work
problems. We consider three datasets and eight dif- on flexible data augmentation policies in computer vi2
Data Augmentation for Electrocardiograms
4. Data Augmentation Methods
sion (Cubuk et al., 2019, 2020; Hataya et al., 2020).
We extend these ideas to ECG predictive tasks by
(1) selecting appropriate transformations for ECG
data, and (2) allowing for class-specific transformation strengths. Since such policies introduce many
hyperparameters, we use a bi-level optimization algorithm to enable scalable policy learning (Lorraine
et al., 2020; Raghu et al., 2021c).
We now describe the data augmentation methods
considered in our experiments. We also present our
new, learnable data augmentation method that can
be used to find task-specific augmentation policies,
and an algorithm to optimize its parameters.
4.1. Existing Data Augmentation Methods
We evaluate the following set of existing data augmentation strategies, which includes operations in the
signal (time-domain) space, frequency space, and interpolated signal space, providing good coverage of
the possible space of augmentations.
3. Problem Setup and Notation
We focus on supervised binary classification problems
from ECG data. Let x ∈ R12×T refer to a 12-lead
ECG of T samples and y ∈ {0, 1} refer to a binary
target. We let D = {(xn , yn )}N
n=1 refer to a dataset
of N ECG-label pairs.
Let f (x; θ) → ŷ be a neural network model with
parameters θ that outputs a predicted label ŷ given
x as input. Network parameters are optimized to
minimize the average binary cross entropy loss LBCE
on the training dataset D(train) .
We restrict our study to single label binary classification problems in this work in order to study the
effect of data augmentation on a per-task basis. One
can extend this to multilabel binary classification by
letting y be a vector of several different binary labels and training the network to produce a vector of
predictions.
Time Masking. This is a commonly used method
in time-series and ECG data augmentation work
(Iwana and Uchida, 2021a; Gopal et al., 2021). We
mask out (set to zero) a contiguous fraction w ∈ [0, 1]
of the original signal of length T , We choose a random
starting sample ts and set all samples [ts , ts +wT ] = 0.
SpecAugment. A highly popular method for augmenting speech signals (Park et al., 2019, 2020). We
follow the approach from Kiyasseh et al. (2021), and
apply masking (setting components to zero) in the
time and frequency domains as follows. We take
the Short-Time Fourier Transform (STFT) of the input signal, and independently mask a fraction w of
the temporal bins and frequency bins (this involves
setting the complex valued entries in these bins to
0 + 0j). The inverse STFT is then used to map the
signal back to the time domain.
Training with Data Augmentation. Let
A(x, y; φ) → x̃ refer to a data augmentation function
with hyperparameters φ that takes the input ECG x
and its label y and outputs an augmented version x̃.
Note that this formulation implicitly assumes that
the augmentation is label preserving, since it does
not also change the label y. Where relevant, the
augmentation hyperparameters φ may control the
strength/probability of applying an augmentation.
The process of training with data augmentation 1
amounts to:
1. Sample a data point and label pair from the training set: (x, y) ∼ D(train) .
2. Apply the augmentation A : x 7→ x̃, to transform
the original input x to an augmented version x̃.
3. Use the pair (x̃, y) in training.
Discriminative Guided Warping (DGW). Introduced in Iwana and Uchida (2021b), this method
uses Dynamic Time Warping (DTW) (Müller, 2007;
Berndt and Clifford, 1994) to warp a source ECG to
match a representative reference signal that is dissimilar to examples from other classes.
SMOTE (Chawla et al., 2002). A commonly
used oversampling strategy, the SMOTE algorithm
generates new synthetic examples of the minority
class by interpolating minority class samples. Given
that many ECG prediction problems are characterized by significant class imbalance, oversampling algorithms are important methods to consider. In contrast to the other methods, the SMOTE algorithm
generates an augmented dataset prior to any training,
1. For the SMOTE baseline this process is slightly different; based on a predefined training set size, rather than
details are in Section 4.
augmenting examples at each training iteration (as
3
Data Augmentation for Electrocardiograms
ui
allows differenThe multiplicative factor stop grad(u
i)
tiation w.r.t the operation selection parameters π.
This enables gradient-based optimization of π (see
Section 4.2.2). The denominator is necessary because
the reparameterized sample from the categorical distribution is not one-hot. Further details are in Appendix A.
Suppose a particular augmentation function Ai
with strength parameters µ0 and µ1 is obtained following Eqns 1 and 2. Then, denoting the input to this
augmentation stage as x with label y, the function Ai
that computes the augmented output is defined as:
presented in Section 4). We set this value to achieve
a balanced number of the two classes.
4.2. TaskAug : A New Augmentation Policy
Motivation. The approaches mentioned so far are
simple to implement and can be effective for various
problems; however, they are fairly inflexible, given
each individually uses only one or two fixed transformations. With ECGs, recall that it is unclear on
a per-task basis which augmentations may help or
worsen performance (Figure 1). Designing a more
flexible augmentation strategy that is optimized on a
per-task basis could help with this problem, and we
now describe such an approach – TaskAug.
Ai (x, y; µ0 , µ1 ) = ti (x; s),
(4)
where ti is the actual transformation applied
to the signal (e.g., time masking), and s is
the
transformation strength, computed as follows:
High-level structure. We define a set of operas
=
yµ
1 + (1 − y)µ0 . See Appendix A for a detailed
tions S = {A1 , . . . , AM }, each of which is an augmenexample
of the different steps in applying TaskAug.
tation function of the form Ai (x, y; µ0 , µ1 ), where x
is the input data point to the augmentation function, Extension to multiclass and multilabel sety is the label, and {µ0 , µ1 } represent the augmenta- tings. Our instantiation of TaskAug is for the bition strengths for datapoints of class label 0 and class nary classification setting, since this is the scenario
label 1 respectively. We separately parameterize the we consider in our experiments. The formulation
augmentation strengths for each class because trans- can be extended to multiclass/multilabel problems
formations may corrupt predictive information in the by defining an operation selection probability masignal for one class but not the other.
trix and strength matrix at each augmentation stage.
The overall augmentation policy consists of a set of The operation selection probabilities and operation
K stages, where at each stage we: (1) sample an aug- strengths for a given example are then obtained by
mentation function Ai to apply; and (2) apply it to taking the matrix product of the relevant parameter
the input signal to that stage. This allows composing matrix and the label vector y.
combinations of operations in a stochastic manner. A
high-level schematic is shown in Figure 2.
4.2.2. Optimizing Policy Parameters
Mathematical definition. The policy is defined
following Hataya et al. (2020). At each augmen- Although the defined policy is flexible, it introduces
tation stage k ∈ {1, . . . , K} we have a set of op- many new parameters – for a binary problem, there
eration selection parameters π (k) ∈ [0, 1]M , where are M operation selection parameters for the cateP (k)
gorical distributions at each stage, and 2 strength pa= 1 ∀k. Each vector π (k) parameterizes
i πi
rameters at each stage, resulting in K × (2 + M ) total
(k)
a categorical distribution such that each entry πi
parameters. Finding effective values for these paramrepresents the probability of selecting operation i at eters with random/grid search or Bayesian optimizaaugmentation stage k. We obtain a reparameteriz- tion is computationally expensive since they require
able sample from this categorical distribution (using training models many times with different parameter
the Gumbel-Softmax trick, (Jang et al., 2016; Maddi- settings. We therefore use a gradient-based learning
son et al., 2016)) at each stage to select the operation scheme to learn these parameters online.
to use, as follows:
We optimize policy parameters to minimize a
model’s
validation loss, which is computed using non(k)
M
u ∼ Categorical(π ) # Note that u ∈ R
(1)
augmented data. Following prior work (Lorraine
i = arg max u
(2) et al., 2020; Hataya et al., 2020; Raghu et al., 2021c),
ui
Ai (x, y; µ0 , µ1 ).
(3) we alternate gradient updates on the network paramx̃ =
stop grad(ui )
eters θ and the augmentation parameters φ by iter4.2.1. Formalizing TaskAug
4
Data Augmentation for Electrocardiograms
Figure 2: Structure of TaskAug. Augmentations to apply are sampled from a set of available operations, and
applied in sequence. Here we show an example with K = 2 stages of augmentation. We omit details
relating to the per-class magnitudes and probabilities of sampling for clarity.
5. Experiments
ating the following steps (details and full algorithm
in Appendix A):
• Optimize the model parameters θ for P steps:
at each step, sample a batch (x, y) of data from
D(train) , augment the batch with the augmentation
policy to obtain (x̃, y), compute the predicted label
ŷ, and update the model parameters using gradient
descent: θ ← θ − η∇L(y, ŷ).
• Compute the validation loss LV using an unaugmented batch from the validation dataset.
• Perform a gradient update on the augmentation
parameters φ. We use the chain rule to re-express
the gradient wrt the augmentation parameters:
We evaluate the data augmentation strategies on
ECG prediction tasks. We have two main experimental questions: (1) in what settings can data augmentation be beneficial, and (2) when data augmentation
does help, which augmentation strategies are most effective? To investigate these questions, we consider
a range of settings that cover three different 12-lead
ECG datasets and eight prediction tasks of varying
difficulty, class imbalance, and training set sizes.
5.1. Experimental Setup
5.1.1. Datasets and Tasks
∂LV
∂LV
∂θ
=
×
,
∂φ
∂θ
∂φ
We highlight key information about our datasets and
tasks here, with a summary in Table 1.
and compute this as follows. The first term on the
RHS is found exactly using straightforward backpropagation; the second term is approximated using the algorithm from Lorraine et al. (2020), leveraging implicit differentiation for efficient computation (since differentiating through training exactly
is too memory-intensive). The augmentation paV
rameters are then updated: φ ← φ − η ∂L
∂φ .
By using this algorithm, augmentation parameters
are learned on a per-task basis and analyzing the
learned parameters may allow us to understand what
augmentations are useful for different problems. We
return to this in Section 5.2.2.
Dataset A is from Massachusetts General Hospital
(MGH) and contains paired 12-lead ECGs and labels
for different cardiac abnormalities. Of the available
labels in the dataset, we select Right Ventricular Hypertrophy (RVH) and Atrial Fibrillation (AFib) as
two of the predictive tasks in our evaluation. These
were chosen because (1) they have been previously
studied as prediction targets from the ECG (Couceiro
et al., 2008; Lin and Lu, 2020), and (2) they have low
positive prevalence: 1% for RVH, and 5% for AFib,
and therefore help to understand the impact of data
augmentation in imbalanced prediction problems.
Computational cost. Optimizing policy parameters in this manner is significantly more computationally efficient than running a grid search over parameter values. With P = 1, running this algorithm
has about 2 − 3× the computational cost of training
without any augmentations.
Dataset B is PTB-XL (Wagner et al., 2020; Goldberger et al., 2000), an open-source dataset of 12-lead
ECGs. Each ECG has labels for four different categories of cardiac abnormality. This dataset has been
used in prior work to evaluate ECG predictive models
(Gopal et al., 2021; Kiyasseh et al., 2021).
5
Data Augmentation for Electrocardiograms
Dataset
Task name
Prevalence
Abnormality type
#ecgs/#patients
Dataset A
Right Ventricular Hypertrophy (RVH)
Atrial Fibrillation (AFib)
1%
5%
Structural
Electrical
705057/705057
705057/705057
Dataset B (PTB-XL)
Hypertrophy (HYP)
ST/T Change (STTC)
Conduction Disturbance (CD)
Myocardial Infarction (MI)
12%
22%
24%
25%
Structural
Ischemia
Electrical
Ischemia
21837/18885
21837/18885
21837/18885
21837/18885
Low Cardiac Ouput (CO)
High Pulmonary Capillary Wedge
Pressure (PCWP)
4%
Hemodynamics
6290/4051
26%
Hemodynamics
6290/4051
Dataset C
Table 1: Summary information about the datasets and tasks considered in our empirical evaluation.
Dataset C is from the same hospital (MGH) as
Dataset A and contains paired ECGs and labels
for two hemodynamics parameters, Cardiac Output (CO) and Pulmonary Capillary Wedge Pressure
(PCWP). These measures of cardiac health are important in deciding treatment strategies for patients
with cardiac disease (Yancy et al., 2013; Hurst et al.,
1990; Solin et al., 1999). Typically, these parameters
can only be measured accurately through an invasive cardiac catheterization procedure (Bajorat et al.,
2006; Hiemstra et al., 2019). As a result, datasets
with paired ECGs and hemodynamics measurements
are relatively small. Considering the use of data augmentations to improve model performance in this limited data regime is therefore clinically relevant. We
specifically consider inferring abnormally low Cardiac
Output, and abnormally high Pulmonary Capillary
Wedge Pressure.
Note that the tasks considered cover different
classes of cardiac abnormalities: ischemia (MI,
STTC), structural (HYP, RVH), electrical (CD,
AFib), and abnormal hemodynamics (low CO, high
PCWP).
both sets). We split the development set into an 8020 training-validation split.
5.1.2. TaskAug Transformations
Based on prior work in time series and ECG data
augmentation (Iwana and Uchida, 2021a; Mehari and
Strodthoff, 2021) we use the following transformations in the TaskAug policy. Mathematical descriptions are in Appendix A.
• Random temporal warp: The signal is warped
with a random, diffeomorphic temporal transformation. This is formed by sampling from a zero
mean, fixed variance Gaussian at each temporal
location in the signal to obtain a velocity field, and
then integrating and smoothing (following Balakrishnan et al. (2018, 2019)) to generate a temporal
displacement field, which is applied to the signal.
The variance is the strength parameter, with higher
variance indicating more warping.
• Baseline wander: A low-frequency sinusoidal
component is added to the signal, with the amplitude of the sinusoid representing the strength.
• Gaussian noise: IID Gaussian noise is added to
Dataset splitting. Since the value of data augthe signal, with the strength parameter representmentation can depend on the amount of training
ing the variance of the Gaussian.
data, we train on different dataset sizes. For the non- • Magnitude scale: The signal amplitude is scaled
hemodynamic tasks (Datasets A and B), we generby a number drawn from a scaled uniform distribuate development datasets with 1000, 2500, and 5000
tion, with the scale being the strength parameter.
ECGs. On the more challenging hemodynamics in- • Time mask: A random contiguous section of the
ference tasks (Dataset C), for elevated PCWP, we
signal is masked out (set to zero).
consider two settings: using a development set of size • Random temporal displacement: The entire
1000, and using the full dataset. For low CO, we
signal is translated forwards or backwards in time
only use the full dataset, since reducing the dataset
by a random temporal offset, drawn from a uniform
size led to poor quality models.
distribution scaled by a strength parameter.
In each setting, we split datasets into development Note that our instantiation of the augmentation poland testing sets on a patient-level (no patient is in icy could utilize many more operations, but we keep
6
Data Augmentation for Electrocardiograms
it to this number for simplicity and to assist in interpreting the learned policies.
regimes for both datasets (N = 1000), we focus on
this setting with results shown in Table 2. Results for
the higher sample regimes are in the appendix. We
summarize key findings here.
The value of augmentation varies by task. For
some tasks such as RVH and MI, almost all augmentation strategies lead to performance improvements.
On other tasks such as STTC and HYP, performance
is the same or worse when applying augmentations.
The improvement seen with RVH could be due to the
fact that it is particularly low prevalence (1%), so all
augmentation strategies have an oversampling effect
and thus boost performance.
TaskAug performs well on average. TaskAug almost always improves on the NoAugs baseline, and
even boosts performance on some tasks where other
augmentations worsen performance (AFib).
Although TaskAug does not always result in a statistically significant (p < 0.05) improvement in AUROC ,
it is the only method to significantly improve AUPRC
over NoAugs on the low-prevalance tasks, RVH and
AFib (see Appendix C, Table 4).
When TaskAug results in lower performance than
other augmentation strategies (e.g., TimeMasking for
CD), it is still competitive with these methods and
never causes a statistically significant reduction in
performance compared to other methods. This suggests that for a new task, it may always be worth
using TaskAug to see if performance is boosted. We
hypothesise that TaskAug’s efficacy is due to its flexible and learned nature, examined in ablation studies
(Section 5.2.3).
Performance improvements are smaller on Dataset
B. The maximum improvement over the NoAugs
baseline in Dataset A (5.8%) is greater than the maximum improvement in Dataset B (2.3%). We hypothesise two reasons for this. Firstly, the prevalence
in Dataset B is higher, meaning that augmentations
may not have as much of an effect at N = 1000. We
study this in Appendix C, Table 11, where we examine performance at the N = 500 data regime for
Dataset B, and find that the maximum improvement
(obtained with TaskAug for MI) goes up to 4%.
Secondly, Dataset A has narrower label definitions
than Dataset B, and this affects performance, especially with TaskAug. The HYP, STTC, and CD
classes of abnormalities in Dataset B aggregrate many
sub-categories together (see Appendix B), and these
sub-categories may each benefit from different augmentations. In contrast, the labels in Dataset A are
fine-grained, and so TaskAug, which optimizes aug-
5.1.3. Implementation Details
Network architecture. We standardize the network architecture to be a 1D convolutional network,
based on the ResNet-18 architecture, since prior work
has shown architectures of this form to be effective
with ECG data (Diamant et al., 2021). Full architectural details are in the appendix.
Training Details. On Datasets A and B, all models are trained for 100 epochs, using early stopping
based on validation loss. For the hemodynamics inference problems on Dataset C, we train models for
50 epochs with early stopping (since we observed significant overfitting after this point). We consider 15
random development/testing set splits for Datasets A
and C (lower prevalences for some tasks meant that
performance was more variable with fewer runs), and
5 splits for Dataset B. We train models using the
Adam optimizer and a learning rate of 1e-3. This
value resulted in stable and effective training across
all models (as compared to 1e-4, 5e-4, and 5e-3). As
evaluation, we compute the AUROC of the best performing model on the held-out testing set, and report
mean/standard error across runs. We also report results for a baseline (NoAugs) that does not use any
data augmentation.
Augmentation Hyperparameters. In TaskAug,
we set the number of augmentation stages to K =
2 (defined in Section 4.2.1), following prior work
(Hataya et al., 2020). For the number of model optimization steps P (defined in Section 4.2.2), we evaluate both P = 1 and P = 5, and select the best
performing setting based on validation set loss. Further discussion on the choice of P is in Appendix A.
For Time Masking and SpecAugment, we search
over the masking window, considering w ∈ {0.1, 0.2}
for SpecAugment (range based on Kiyasseh et al.
(2021)) and w ∈ {0.1, 0.2, 0.5} for Time Masking
(range based on Gopal et al. (2021)).
5.2. Results
5.2.1. Quantitative results
Non-hemodynamics tasks. We first analyze performance of augmentation strategies on the nonhemodynamics tasks. Given that performance improvements are most evident in the lowest sample
7
Data Augmentation for Electrocardiograms
Dataset A
RVH
AFib
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
72.6 ± 2.7
78.4 ± 1.9
75.9 ± 1.8
73.6 ± 1.7
77.9 ± 1.7
72.8 ± 2.1
Dataset B
HYP
STTC
MI
79.8 ± 1.4
82.8 ± 1.0
79.0 ± 1.4
77.4 ± 1.5
77.2 ± 2.1
77.9 ± 1.9
80.0 ± 0.8
82.3 ± 0.5∗
81.2 ± 0.6
81.1 ± 0.6
81.1 ± 0.7
81.1 ± 1.3
84.3 ± 1.4
83.7 ± 0.5
80.4 ± 0.6
83.9 ± 0.7
83.5 ± 0.8
82.9 ± 0.7
87.6 ± 0.8
87.8 ± 0.4
87.0 ± 0.5
87.5 ± 0.5
87.7 ± 0.4
87.7 ± 0.7
CD
82.2 ± 0.6
83.1 ± 0.4
82.6 ± 0.8
81.8 ± 1.0
82.2 ± 0.7
83.8 ± 1.1
Table 2: Augmentation strategies improve AUROC on detecting most cardiac abnormalities in the
low-sample regime (N = 1000), and TaskAug is among the best-performing methods. Table shows mean and standard error of AUROC (best-performing method bolded, second best underlined,
statistically significant (p < 0.05) improvement over NoAugs marked ∗ ). The impact of augmentations is
task-dependent, with some tasks (such as RVH, MI) showing improved performance on average with almost
all strategies, and others (HYP) showing no improvement with any strategy. TaskAug is among the best
methods across tasks, and improves performance on tasks such as AFib where no other augmentations help.
Low CO
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
65.9
68.2
66.0
68.3
66.1
66.8
±
±
±
±
±
±
1.2
1.0
1.4
0.9
0.9
1.1
Dataset C
High PCWP: N = 1000
66.7
67.9
67.2
66.4
66.4
67.3
±
±
±
±
±
±
0.7
0.7
0.5
0.6
1.3
0.4
High PCWP: All Data
74.4
75.1
73.6
74.9
75.0
74.6
±
±
±
±
±
±
0.5
0.4
0.5
0.4
0.4
0.4
Table 3: Training with data augmentation improves AUROC on two hemodynamics inference tasks,
and TaskAug again is among the best-performing methods. Table shows mean and standard error
of AUROC (best-performing method bolded, second best underlined). All methods are comparable with
or improve on the no augmentation baseline for Low CO prediction, possibly because of the low prevalence
of the label (4%). The performance of methods on the High PCWP task is more variable across the two
sample sizes. TaskAug obtains improvements in all three settings considered.
prevalence of the positive label (4%). For inferring
high PCWP, at both low sample and higher samples, TaskAug obtains improvements in performance
(though not significant at the p < 0.05 level); however, other methods do not consistently improve on
the no augmentation baseline. Although improvements in AUROC are not statistically significant, we
observe significant improvements with TaskAug in
AUPRC for low CO detection (see Appendix C, Table 6). Again, we see that the benefit of augmentation varies with the task, prevalance, and dataset
size, and that TaskAug is better than or competitive
with other strategies.
mentations on a per-task basis, learns more appropriate augmentation strategies. This hypothesis is supported by the fact that with MI (a more fine-grained
label than HYP, CD, and STTC) we observe improvements over the NoAugs baseline (clearly seen in the
N = 500 regime, Appendix C, Table 11).
Performance improvements at higher samples are
lower, as seen in the results in Appendix C. Augmentations do not worsen performance however, and
some tasks (STTC, CD) benefit a small amount,
∼ +1% AUROC.
Hemodynamics tasks. Table 3 presents results
for performance on the more challenging hemodynamics prediction tasks. All methods are comparable with or improve on the no augmentation baseline for low CO prediction, likely because of the low
8
Data Augmentation for Electrocardiograms
Stage 1
Probability of Operation Selection
Stage 2
0.4
Stage 1 Temporal Warp Strength
1.1
0.3
1.0
0.2
0.9
0.1
0.0
1.2
warp wander noise mask disp scale no-op
warp wander noise mask disp scale no-op
0.8
(a) Operation selection probabilities
Class 0 (No AFib)
Class 1 (AFib)
(b) Warp strengths
Figure 3: The TaskAug policy for Atrial Fibrillation detection. We focus on the probability of selecting
each transformation in both augmentation stages (left) and the optimized temporal warp strengths in
the first stage (right). We show the mean/standard error of these optimized policy parameters over 15
runs. Given the characteristic features of AFib (e.g., irregular R-R interval), Time Masking is likely to
be label preserving and therefore it is sensible that it has a high probability of selection. The temporal
warp strength for positive samples is higher than that for negative samples, which makes sense since time
warping a negative sample too strongly could change its label.
0.4
Stage 1
Probability of Operation Selection
Stage 2
0.3
0.6
0.5
0.2
0.4
0.1
0.0
Stage 2 Magnitude Scale Strength
warp wander noise mask disp scale no-op
warp wander noise mask disp scale no-op
(a) Operation selection probabilities
0.3
Class 0 (Normal PCWP) Class 1 (High PCWP)
(b) Magnitude scale strength
Figure 4: The TaskAug policy for detecting elevated Pulmonary Capillary Wedge Pressure. We focus
on the probability of selecting each transformation in both augmentation stages (left) and the optimized
magnitude scaling strengths in the second stage (right). We show the mean/standard error of these
optimized policy parameters over 15 runs. There exists little domain knowledge about what features
in the ECG may encode elevated PCWP, so examining the learned augmentations here could provide
hypotheses of invariances in the data. Of interest is that the positive class is augmented with stronger
magnitude scaling than the negative class, suggesting that scaling negative examples could affect their
labels.
a sensible choice. Considering the learned time warp
strength in Figure 3(b), we observe that signals labelled negative for AFib are warped less strongly than
those with AFib, again sensible since time warping
may affect the label of a signal and introduce AFib
in a signal where it was not originally present.
5.2.2. Analyzing learned policies
We analyze the learned policies for three of the predictive tasks: AFib, PCWP, and RVH (appendix).
AFib, Figure 3. We see that time mask has a high
probability of selection (Figure 3(a)). Since AFib is
characterized in the ECG by an irregular R peak- PCWP, Figure 4. We have limited domain unR peak interval (Couceiro et al., 2008), which is of- derstanding of what augmentations may be label preten present regardless of which section of ECG is se- serving and help model performance, since detecting
lected, time masking is likely label preserving, and is high PCWP from ECGs is not something clinicians
9
Data Augmentation for Electrocardiograms
MI
84
AFib
84
82
AUROC
80
80
78
No Augs
Init Augs Optimized Augs
78
78
76
76
80
80
78
76
76
No Augs
Init Augs Optimized Augs
No Augs TaskAug Ablation TaskAug
(Not Class-Specific)
Figure 5: Optimizing the TaskAug policy param-
MI
82
82
AUROC
AUROC
82
84
AUROC
AFib
84
No Augs TaskAug Ablation TaskAug
(Not Class-Specific)
Figure 6: Class-specific magnitude parameters in
eters results in performance improvements. We show the mean/standard error of
AUROC over 15 runs for AFib and over 5 runs
for MI. Without optimizing policy parameters
(InitAug), performance is comparable to not
using augmentations at all, indicating the importance of learning the policy parameters.
TaskAug lead to improvements in performance. We show the mean/standard error
of AUROC over 15 runs for AFib and over 5
runs for MI. This is particularly true for tasks
such as AFib where some operations may not
be label preserving.
Appendix C, we study this at different dataset sizes
and find that performance is improved by optimization at each size.
are typically able to do (Schlesinger et al., 2021). Analyzing the augmentations could provide hypotheses
about what features in the data encode the class label. Noise, displacement, and baseline wander all
obtain higher weight in the first stage, and scaling
obtains higher weight in the second stage. The high
weight assigned to noise could be to help the model
build invariance to it, and not use it as a predictive aspect of the signal. Studying the magnitude
scaling in Figure 4(b), we see positive examples are
scaled significantly more than negative examples. It
is possible that negative examples are more sensitive
to scale, and scaling them pushes them into positive
example space. The positive examples may have more
variance in scaling, and thus scaling them further has
less of an effect.
How much do class-specific magnitudes help?
TaskAug instantiates magnitude parameters for the
augmentation operations on a per-class basis, as described in Section 4.2.1, allowing positive and negative examples to be augmented differently. We examine this further, considering the AFib and MI detection tasks and N = 1000. We compare performance
using class-specific magnitude parameters (the positive and negative examples have independent augmentation magnitudes µ1 and µ0 ) vs. using global
magnitude parameters (the positive and negative examples are forced to have the same augmentation
magnitude: µ = µ0 = µ1 ). Results are shown in
Figure 6. We observe noticeable improvements in
5.2.3. Ablation Studies
performance with class-specific magnitude parameHow much does optimizing augmentations ters, demonstrating the importance of independently
help? Our results show that TaskAug offers im- specifying magnitudes for the two classes. In Approvements in performance. In Figure 5, we exam- pendix C, we study this at different dataset sizes and
ine how the actual optimization of the augmentation find that performance is improved at each size.
policy parameters (operation selection probabilities
and magnitudes, Section 4.2.1) affects performance,
5.2.4. Summary and best practices
considering the AFib and MI detection tasks and
N = 1000. We compare the performance of optimiz- • Training with data augmentations does not always
ing the policy parameters vs. keeping them fixed at
improve model performance, and may even hurt it.
The impact of augmentation depends on nature of
their initialized values and training. We observe imthe task, positive class prevalence, and dataset size.
provements in performance through the optimization
process, suggesting that it is not only the range of • Augmentations are most often useful in the lowsample regime. Where the prevalence is particuaugmentations that leads to improved performance,
larly low (see results for RVH detection) various
but also the optimization of the policy parameters. In
10
Data Augmentation for Electrocardiograms
augmentation strategies improve performance, perhaps by functioning as a form of oversampling.
• Data augmentations do not always improve performance at high sample sizes, but do not hurt it.
• TaskAug, our proposed augmentation strategy, is
the most effective method on average, and could
therefore be the first augmentation strategy one
tries on a new ECG prediction problem. TaskAug
defines a flexible augmentation policy that is optimized on a task-dependent basis, which directly
contributes to its effectiveness.
• TaskAug also offers insights as to what augmentations are most effective for a given problem, which
could be useful in novel prediction tasks (e.g.,
hemodynamics inference) to suggest what aspects
of the ECG determine the class label.
6. Conclusion
In this work, we studied the use of data augmentation for prediction problems from 12-lead electrocardiograms (ECGs). We outlined TaskAug, a new,
learnable data-augmentation strategy for ECGs, and
conducted an empirical study of this method and several existing augmentation strategies.
In our experimental evaluation on three ECG
datasets and eight distinct predictive tasks, we find
that data augmentation is not always helpful for
ECG prediction problems, and for some tasks may
worsen performance. Augmentations can be most
helpful in the low-sample regime, and specifically
when the prevalence of the positive class is low. Our
proposed learnable augmentation strategy, TaskAug,
was among the strongest performing methods in all
tasks. TaskAug augmentation policies are additionally interpretable, providing insight as to what transformations are most important for different problems. Future work could consider applying TaskAug
to other settings (e.g., multiview contrastive learning)
and modalities (e.g., EEGs) where flexible augmentation policies may be useful and could be interpreted
to provide scientific insight.
11
Data Augmentation for Electrocardiograms
Institutional Review Board (IRB)
Ricardo Couceiro, Paulo Carvalho, Jorge Henriques,
Manuel Antunes, Matthew Harris, and Jörg Habetha. Detection of atrial fibrillation using modelbased ecg analysis. In 2008 19th International Conference on Pattern Recognition, pages 1–5. IEEE,
2008.
This study was approved by the Institutional Review
Board (IRB) at Massachusetts General Hospital (protocol 2020P000132).
References
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:
Learning augmentation strategies from data. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–
123, 2019.
J Bajorat, R Hofmockel, DA Vagts, M Janda, B Pohl,
C Beck, and G Noeldge-Schomburg. Comparison of
invasive and less-invasive techniques of cardiac output measurement under different haemodynamic
conditions in a pig model. European journal of
anaesthesiology, 23(1):23–30, 2006.
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and
Quoc V Le. Randaugment: Practical automated
data augmentation with a reduced search space. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,
pages 702–703, 2020.
Guha Balakrishnan, Amy Zhao, Mert Sabuncu, John
Guttag, and Adrian V. Dalca. An unsupervised
learning model for deformable medical image registration. CVPR: Computer Vision and Pattern
Recognition, pages 9252–9260, 2018.
Guha Balakrishnan, Amy Zhao, Mert Sabuncu, John
Guttag, and Adrian V. Dalca. Voxelmorph: A
learning framework for deformable medical image
registration. IEEE TMI: Transactions on Medical
Imaging, 38:1788–1800, 2019.
Nathaniel Diamant, Erik Reinertsen, Steven Song,
Aaron Aguirre, Collin Stultz, and Puneet Batra.
Patient contrastive learning: a performant, expressive, and practical approach to ecg modeling. 2021.
Kasun Bandara, Hansika Hewamalage, Yuan-Hao
Liu, Yanfei Kang, and Christoph Bergmeir. Improving the accuracy of global forecasting models using time series data augmentation. Pattern
Recognition, 120:108148, 2021.
Francis M Fesmire, Robert F Percy, Jim B Bardoner,
David R Wharton, and Frank B Calhoun. Usefulness of automated serial 12-lead ecg monitoring
during the initial emergency department evaluation
of patients with chest pain. Annals of emergency
medicine, 31(1):3–11, 1998.
Rohan Banerjee and Avik Ghose. Synthesis of realA. Goldberger, L. A. Amaral, L. Glass, Jefistic ecg waveforms using a composite generative
frey M. Hausdorff, P. Ivanov, R. Mark, J. Mietus,
adversarial network for classification of atrial fibG. Moody, C. Peng, and H. Stanley. PhysioBank,
rillation. In 2021 29th European Signal Processing
PhysioToolkit, and PhysioNet: components of a
Conference (EUSIPCO), pages 1145–1149, 2021.
new research resource for complex physiologic sigdoi: 10.23919/EUSIPCO54536.2021.9616079.
nals. Circulation, 101 23:E215–20, 2000.
Donald J Berndt and James Clifford. Using dynamic
Bryan Gopal, Ryan W. Han, Gautham Raghupathi,
time warping to find patterns in time series. In
Andrew Y. Ng, Geoffrey H. Tison, and Pranav RaKDD workshop, volume 10, pages 359–370. Seattle,
jpurkar. 3kg: Contrastive learning of 12-lead elecWA, USA:, 1994.
trocardiograms using physiologically-inspired augmentations. 2021.
Henry Blackburn, Ancel Keys, Ernst Simonson,
Pentti Rautaharju, and Sven Punsar. The electrocardiogram in population studies: a classification Awni Y Hannun, Pranav Rajpurkar, Masoumeh
Haghpanahi, Geoffrey H Tison, Codie Bourn,
system. Circulation, 21(6):1160–1175, 1960.
Mintu P Turakhia, and Andrew Y Ng.
Nitesh V Chawla, Kevin W Bowyer, Lawrence O
Cardiologist-level arrhythmia detection and
Hall, and W Philip Kegelmeyer. Smote: synthetic
classification in ambulatory electrocardiograms
minority over-sampling technique. Journal of artiusing a deep neural network. Nature medicine, 25
ficial intelligence research, 16:321–357, 2002.
(1):65–69, 2019.
12
Data Augmentation for Electrocardiograms
Faezeh Nejati Hatamian, Nishant Ravikumar, Su- Gen-Min Lin and Henry Horng-Shing Lu. A 12laiman Vesal, Felix P Kemeth, Matthias Struck,
lead ecg-based system with physiological parameand Andreas Maier. The effect of data augmentaters and machine learning to identify right ventriction on classification of atrial fibrillation in short
ular hypertrophy in young adults. IEEE Journal of
single-lead ecg signals using deep neural networks.
Translational Engineering in Health and Medicine,
In ICASSP 2020-2020 IEEE International Confer8:1–10, 2020. doi: 10.1109/JTEHM.2020.2996370.
ence on Acoustics, Speech and Signal Processing
Jonathan Lorraine, Paul Vicol, and David Duvenaud.
(ICASSP), pages 1264–1268. IEEE, 2020.
Optimizing millions of hyperparameters by implicit
differentiation. In International Conference on ArRyuichiro Hataya, Jan Zdenek, Kazuki Yoshitificial Intelligence and Statistics, pages 1540–1552.
zoe, and Hideki Nakayama. Meta approach to
PMLR, 2020.
data augmentation optimization. arXiv preprint
arXiv:2006.07965, 2020.
Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
ation of discrete random variables. arXiv preprint
Sun. Deep residual learning for image recognition.
arXiv:1611.00712, 2016.
In Conference on Computer Vision and Pattern
Recognition, pages 770–778, 2016.
Temesgen Mehari and Nils Strodthoff.
Selfsupervised representation learning from 12-lead ecg
Bart Hiemstra, Geert Koster, Renske Wiersema,
data. arXiv preprint 2103.12676, 2021.
Yoran M Hummel, Pim van der Harst, Harold
Snieder, Ruben J Eck, Thomas Kaufmann, Meinard Müller. Dynamic time warping. Information
Thomas WL Scheeren, Anders Perner, et al. The
retrieval for music and motion, pages 69–84, 2007.
diagnostic accuracy of clinical examination for estimating cardiac index in critically ill patients: Daniel S Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D Cubuk, and
the simple intensive care studies-i. Intensive care
Quoc V Le. Specaugment: A simple data augmedicine, 45(2):190–200, 2019.
mentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
J Hurst, C Rackley, E Sonnenblick, and N Wenger.
The heart, arteries and veins, volume 1. McGrawDaniel S Park, Yu Zhang, Chung-Cheng Chiu,
Hill, 1990.
Youzheng Chen, Bo Li, William Chan, Quoc V
Le, and Yonghui Wu. Specaugment on large scale
Brian Kenji Iwana and Seiichi Uchida. An empirical
datasets. In ICASSP 2020-2020 IEEE Internasurvey of data augmentation for time series clastional Conference on Acoustics, Speech and Sigsification with neural networks. Plos one, 16(7):
nal Processing (ICASSP), pages 6879–6883. IEEE,
e0254841, 2021a.
2020.
Brian Kenji Iwana and Seiichi Uchida. Time series
data augmentation for neural networks by time
warping with a discriminative teacher. In 2020
25th International Conference on Pattern Recognition (ICPR), pages 3558–3565. IEEE, 2021b.
Aniruddh Raghu, John Guttag, Katherine Young,
Eugene Pomerantsev, Adrian V Dalca, and
Collin M Stultz. Learning to predict with supporting evidence: applications to clinical risk prediction. In Proceedings of the Conference on Health,
Inference, and Learning, pages 95–104, 2021a.
Eric Jang, Shixiang Gu, and Ben Poole. Categorical
reparameterization with gumbel-softmax. arXiv Aniruddh Raghu, Jonathan Lorraine, Simon Kornpreprint arXiv:1611.01144, 2016.
blith, Matthew McDermott, and David K Duvenaud. Meta-learning to improve pre-training. AdDani Kiyasseh, Tingting Zhu, and David A Clifton.
vances in Neural Information Processing Systems,
Clocs: Contrastive learning of cardiac signals
34, 2021b.
across space, time, and patients. In International
Conference on Machine Learning, pages 5606– Aniruddh Raghu, Maithra Raghu, Simon Kornblith,
5615. PMLR, 2021.
David Duvenaud, and Geoffrey Hinton. Teaching
13
Data Augmentation for Electrocardiograms
with commentaries. In International Conference Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song,
on Learning Representations, 2021c.
Jingkun Gao, Xue Wang, and Huan Xu. Time series data augmentation for deep learning: A survey.
Sushravya Raghunath, Alvaro E Ulloa Cerna,
arXiv preprint arXiv:2002.12478, 2020.
Linyuan Jing, Joshua Stough, Dustin N Hartzel,
Joseph B Leader, H Lester Kirchner, Martin C Clyde W Yancy, Mariell Jessup, Biykem Bozkurt,
Javed Butler, Donald E Casey, Mark H Drazner,
Stumpe, Ashraf Hafez, Arun Nemani, et al. PreGregg C Fonarow, Stephen A Geraci, Tamara Hordiction of mortality from 12-lead electrocardiogram
wich, James L Januzzi, et al. 2013 accf/aha guidevoltage data using a deep neural network. Nature
line for the management of heart failure: executive
medicine, 26(6):886–891, 2020.
summary: a report of the american college of cardiology foundation/american heart association task
Stephen M Salerno, Patrick C Alguire, and Herbert S
force on practice guidelines. Journal of the AmeriWaxman. Competency in interpretation of 12-lead
can College of Cardiology, 62(16):1495–1539, 2013.
electrocardiograms: a summary and appraisal of
published evidence. Annals of Internal Medicine,
138(9):751–760, 2003.
Daphne Schlesinger, Nathaniel Diamant, Aniruddh
Raghu, Erik Reinertsen, Katherine Young, Puneet
Batra, Eugene Pomerantsev, and Collin M. Stultz.
A deep learning model for inferring elevated pulmonary capillary wedge pressures from the 12-lead
electrocardiogram. 2021.
Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning.
Journal of Big Data, 6(1):1–48, 2019.
Slawek Smyl and Karthik Kuber. Data preprocessing and augmentation for multiple short time series
forecasting with recurrent neural networks. In 36th
International Symposium on Forecasting, 2016.
Peter Solin, Peter Bergin, Meroula Richardson,
David M Kaye, E Haydn Walters, and Matthew T
Naughton. Influence of pulmonary capillary wedge
pressure on central apnea in heart failure. Circulation, 99(12):1574–1579, 1999.
Terry T Um, Franz MJ Pfister, Daniel Pichler,
Satoshi Endo, Muriel Lang, Sandra Hirche, Urban
Fietzek, and Dana Kulić. Data augmentation of
wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference
on Multimodal Interaction, pages 216–220, 2017.
Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech
Samek, and Tobias Schaeffter. PTB-XL, a large
publicly available electrocardiography dataset. Scientific data, 7(1):1–15, 2020.
14
Data Augmentation for Electrocardiograms
Original signal example
Appendix A. Augmentation Methods
Signal generated with SMOTE
In this section, we provide further details on the different augmentation strategies explored (existing and
TaskAug), and visualize their operation.
A.1. Existing methods
Figures 7-10 present examples following augmentation using the existing methods. We show only one
lead for clarity; however, these operations will be applied to each lead.
Original signal
Figure 10: SMOTE.
policy and an example of applying the different steps;
(2) A more detailed description of the nested optimization algorithm used to learn TaskAug parameters, including a full algorithm; and (3) mathematical
descriptions of the operations used in TaskAug in our
experiments and a visualization of their effect on an
ECG signal.
After time masking
A.2.1. Structure of policy
Figure 7: Time Masking.
Original signal
Mathematical definition. As described in Section 4.2.1, the TaskAug policy is defined following
Hataya et al. (2020). At each augmentation stage
k ∈ {1, . . . , K} we have a set of operation selection
P (k)
parameters π (k) ∈ [0, 1]M , where i πi = 1 ∀k.
Each vector π (k) parameterizes a categorical distri(k)
bution such that each entry πi represents the probability of selecting operation i at augmentation stage
k. We obtain a reparameterizable sample from this
categorical distribution (using the Gumbel-Softmax
trick, (Jang et al., 2016; Maddison et al., 2016)) at
each stage to select the operation to use, as follows:
After SpecAugment
Figure 8: SpecAugment.
Original signal
u ∼ Categorical(π (k) ) # Note that u ∈ RM (5)
i = arg max u
(6)
ui
Ai (x, y; µ0 , µ1 ).
(7)
x̃ =
stop grad(ui )
After DGW
Why the multiplicative factor? We use the mului
to allow gradient flow to
tiplicative factor stop grad(u
i)
the operation selection parameters π. If we just selected i = arg max u and had no scaling in Eqn 7,
then there would be no gradient flow to π, since the
arg max operation is not differentiable.
The denominator of this scaling factor is necessary
because ui , obtained from the reparameterized sample from the categorical distribution, is not one-hot.
The resulting fraction used as the scaling factor always has magnitude 1, since |stop grad(ui )| = |ui |.
Figure 9: Discriminative Guided Warping (DGW).
A.2. TaskAug
We provide more details about TaskAug: (1) further
information about the mathematical formalism of the
15
Data Augmentation for Electrocardiograms
When we take the gradient, we get:
re-express this second term as:
−1
2
∂ 2 LT
∂ LT
∂ θ̂
=−
×
T
∂φ
∂θ ∂θ
∂θ ∂φT
∂
1
ui
∂ui
=
,
∂π stop grad(ui )
stop grad(ui ) ∂π
so the stop grad(ui ) acts as a scaling term.
,
(8)
θ̂(φ)
which is a product of an inverse Hessian and a matrix
of mixed partial derivatives. Adopting the algorithm
from Lorraine et al. (2020), we approximate this with
a truncated Neumann series with 1 term, and implicit
vector-Jacobian products.
Example application of TaskAug. Suppose we
have a one-stage TaskAug policy, K = 1, our augmentation set has two operations S = {A1 , A2 } which
are A1 = TimeMask(x, y; µ0 = 0.2, µ1 = 0.1) and
A2 = Noise(x, y; µ0 = 2.1, µ1 = 5.3), and the operation selection probability vector is π = [0.9, 0.1]
(that is, we select TimeMask with probability 0.9,
and noise with probability 0.1). Now consider applying TaskAug to a (data, label) pair (x, 1), i.e., the
label is 1. We follow these steps:
1. Obtain a reparameterizable sample u from
Categorical([0.9, 0.1]): let this be u = [0.75, 0.25].
2. Find i = arg max u; in this case, i = 1.
3. Select the operation A1 , i.e. TimeMask.
4. Compute the masking strength based on the label.
Recall this is defined as s = yµ1 + (1 − y)µ0 , so
s = 1 × 0.1 + (1 − 1) × 0.2 = 0.1.
5. Apply time-masking with strength 0.1 to x, generating x̂.
u1
to generate x̃.
6. Scale this by stop grad(u
1)
Training algorithm. Incorporating this gradient
estimator, the algorithm to jointly optimize base
model parameters and TaskAug policy parameters is
given in Algorithm 1, mirroring the approach used in
Raghu et al. (2021c).
Algorithm 1 Optimizing TaskAug parameters.
1: Initialize base model parameters θ and TaskAug
parameters φ
2: for t = 1, . . . , T do
3:
Compute training loss, LT (θ)
T
4:
Compute ∂L
∂θ
T
5:
Update θ ← θ − ηθ ∂L
∂θ
6:
if t % P == 0 then
7:
Set θ̂ = θ
8:
Compute the validation loss, LV (θ̂)
V
9:
Compute ∂L
∂ θ̂
A.2.2. Parameter optimization
10:
Approximate
∂ θ̂
∂φ
using Equation 8.
∂LV
∂ θ̂
V
Compute the derivative ∂L
∂φ = ∂ θ̂ × ∂φ
As detailed in the main text, there are many learn- 11:
using the previous two steps.
able parameters in TaskAug, and we use gradientV
12:
Update φ ← φ − ηφ ∂L
∂φ
based optimization to learn these jointly with the
13:
end
if
base model parameters. Here, we provide some more
details about the estimation of the gradient wrt the 14: end for
TaskAug parameters, and also include a full algorithm detailing the training procedure, Algorithm 1. Choice of P . The value of P influences how many
Estimating TaskAug parameter gradients. ‘inner’ gradient steps (to the base model) we perform
Let the base model parameters after P update steps before an ‘outer’ gradient step (to the TaskAug pabe denoted as θ̂(φ). We update the TaskAug policy rameters). There is a tradeoff here: if P is too small,
∂ θ̂
parameters to minimize the base model’s validation then applying the IFT to approximate ∂φ will result
in a poor approximation (Lorraine et al., 2020); if
loss LV , with the gradient of interest being:
P is too large, then updates to the policy parameters will have little effect on model parameters since
∂LV
∂ θ̂
∂LV
=
.
×
the base model has already reached minimal training
∂φ
∂φ
∂ θ̂
loss (and may start to overfit). In our experiments,
The first term on the RHS can be found exactly us- we find that P > 5 suffered from this second probing standard backpropagation. To compute the sec- lem, and P = 1 was sometimes unstable due to the
ond term, we re-express it using the implicit function first problem. In general, P = 1 worked well at small
theorem (IFT) as in Lorraine et al. (2020). Using sample sizes (N = 1000), and P = 5 worked better
LT to denote the training loss, the IFT allows us to at N = 2500 and N = 5000.
16
Data Augmentation for Electrocardiograms
sigmoid(s) × Uniform(0.75, 1.25), where s is a
learnable strength parameter, initialized to 0.
• Temporal displacement. We shift the entire signal in time, padding with zeros where required.
Our implementation directly generates a displacement field (as with temporal warping) and uses
the spatial transformation from Balakrishnan et al.
(2018, 2019) to transform the signal. This allows the operation to be differentiable, and for us
to learn the displacement strength s. The displacement magnitude is a Uniform distribution on
[−100 × s2 , 100 × s2 ], with the strength being initialized to 0.5.
A.2.3. Augmentation operations
Figure 11 shows the different operations used in
TaskAug. We show only one lead for clarity; however, these operations will be applied to each lead.
We now provide more details on the implementation
of these operations in our experiments.
• TimeMask. As with the existing TimeMask
strategies, we randomly select a contiguous portion
of the signal to set to zero. We set 10% of the signal to zero in our implementation. This parameter
is not optimized.
• Gaussian Noise. IID Gaussian noise is added
to the signal. This is formed as follows. We first
compute the standard deviation of each lead of the
signal: let us denote this as σ. Then, the noise
added to each sample of the signal is expressed as:
ǫ = 0.25 × σ × sigmoid(s) × N (0, 1), where s is
the learnable strength parameter, initialized to 0.
The coefficient 0.25 was found by visual inspection
of some augmented examples, and observing that
this allowed flexible augmentations to be generated
without overwhelming the signal with noise.
• Temporal warping. The signal is warped with
a random, diffeomorphic temporal transformation.
To form this, we sample from a Gaussian with zero
mean, and a fixed variance 100 × s2 , where s is the
learnable strength parameter (initialized to 1), at
each temporal location, to generate a length T dimensional random velocity field. This velocity field
is then integrated (following the scaling and squaring numerical integration routine used by Balakrishnan et al. (2018, 2019)). This resulting displacement field is then smoothed with a Gaussian filter
to generate the smoothed temporal displacement
field. This field represents the number of samples
each point in the original signal is translated in
time. The field is then used to transform the signal, translating each channel in the same way (i.e.,
the field is the same across channels).
• Baseline wander. We firstly form a wander amplitude by computing: A = 0.25 × sigmoid(s) ×
Uniform(0, 1), where again s is a learnable strength
parameter. Then, we compute the frequency and
phase of the sinusoidal offset. The frequency is
, based on the
computed as: f = 20×Uniform(0,1)+10
60
approximate number of breaths per minute for an
adult. The phase is: φ = 2π×Uniform(0, 1). Then,
the sinusoidal offset is computed as: A sin(f t + φ).
• Magnitude scaling.
We scale the entire signal by a random magnitude given by
17
Data Augmentation for Electrocardiograms
Original signal
After TaskAug Time Mask
After TaskAug Gaussian Noise
2
2
2
1
1
1
0
0
0
1
1
1
After TaskAug Temporal Warp
2
After TaskAug Baseline Wander
2
3
After TaskAug Magnitude Scale
2
1
1
1
0
0
0
1
1
1
After TaskAug Temporal Displacement
2
1
0
1
Figure 11: Examples of the different operations used in TaskAug.
18
Data Augmentation for Electrocardiograms
Appendix B. Dataset Details
fibrillation with moderate ventricular response”, “fibrillation/flutter”, “atrial fibrillation with controlled
ventricular response”, “afib”, “atrial fib”, “afibrillation”, “atrial fibrillation”, “atrialfibrillation”.
Preprocessing. ECGs were sampled at 250 Hz for
10 seconds, resulting in a 2500 × 12 tensor for all 12
leads, per-ECG. We normalized the signals by dividing by 1000. Other forms of normalization for this
dataset (e.g., z-scoring) resulted in some abnormally
large/small values.
We provide more details about the three datasets.
B.2. Dataset B
The four labels are obtained by aggregating relevant
sets of diagnostic statements – we refer the reader to
the PTB-XL paper (Wagner et al., 2020) for further
details. Of relevance here is that certain labels, such
as MI, contain a small number of distinct diagnostic statements (3), potentially suggesting why many
augmentation strategies can help – it is a fine-grained
task. Others (such as CD) are much broader, covering many more diagnostic statements.
Preprocessing. ECGs in the dataset are sampled
at 500 Hz for 10 seconds; we downsample these by
a factor of 2 for consistency with Dataset A and C,
resulting in a 2500 × 12 tensor for all 12 leads, perECG. Normalization involved z-scoring, following the
code provided with the dataset.
B.1. Dataset A
B.3. Dataset C
The hemodynamics prediction cohort consists of patients who had an ECG and right heart catheterization procedure on the same day. The catheterization
procedure measures hemodynamics variables including the pulmonary capillary wedge pressure (PCWP)
and cardiac output (CO), and these are used to form
the prediction targets. We consider inferring abnormally low Cardiac Output (less than 2.5 L/min), and
abnormally high Pulmonary Capillary Wedge Pressure (greater than 20 mmHg).
The labels for RVH and AFib were assigned to each
example based on whether relevant diagnostic statements were present in either a clinician’s read of the
ECG, or a machine read of the ECG.
For RVH, there were six diagnostic statements that
led to a positive label being assigned: “right ventricular hypertrophy”, “biventricular hypertrophy”,
“combined ventricular hypertrophy”, “right ventricular enlargement”, “rightventricular hypertrophy”,
“biventriclar hypertrophy”.
Preprocessing. ECGs were sampled at 250 Hz for
10 seconds, resulting in a 2500 × 12 tensor for all 12
leads, per-ECG. We normalized the signals by dividing by 1000. Other forms of normalization for this
dataset (e.g., z-scoring) resulted in some abnormally
large/small values, so we opted for the division-based
normalization.
For AFib, there were nine such statements: “atrial
fibrillation with rapid ventricular response”, “atrial
19
Data Augmentation for Electrocardiograms
Appendix C. Experiments
In this section, we provide further experimental details. We first provide implementation details, and
then outline additional experimental results including: Results for AUPRC in the low-sample (N =
1000) regime, performance on Datasets A and B
in the high-sample regime, performance on Dataset
B in an additional low sample regime (N = 500
data points), interpretation of the TaskAug policy
for RVH, a study of the impact of optimizing policy parameters across different sample size regimes,
and a study of the impact of class-specific magnitudes
across different sample size regimes.
C.1. Implementation details
Network architecture. In all experiments, we use
RVH
AFib
a 1D CNN based on a ResNet-18 (He et al., 2016) arNoAugs
7.4 ± 1.3
21.2 ± 2.0
chitecture. This model has convolutions with a kernel
TaskAug
10.8 ± 0.8∗ 27.3 ± 1.8∗
size of 15, and stride 2 (informed by the temporal winSMOTE
9.7 ± 1.2
21.0 ± 2.2
dow we want the convolutions to operate over). The
DGW
7.1 ± 0.9
19.4 ± 2.3
blocks in the ResNet architecture have convolutional
SpecAug
10.6 ± 1.2
21.1 ± 2.0
layers with 32, 64, 128, and 256 channels respectively.
TimeMask 10.1 ± 1.5
20.3 ± 2.3
The output after the final block is average pooled in
the temporal dimension, and then a linear layer is apTable 4: Mean and standard error of AUPRC for various
plied to predict the probability of the positive class.
data augmentation strategies when detecting
cardiac abnormalities on Dataset A. We consider a low-sample regime with a development
set of 1000 data points. The best-performing
method is bolded, and the second best is underlined, and ∗ indicates statistically significant
improvement at the p < 0.05 level. TaskAug is
the only method to obtain significant improvements in performance on both tasks.
Optimization settings. As discussed, we used
Adam with a learning rate of 1e-3 for all methods,
given that this resulted in stable training across all
settings. When optimizing the TaskAug policy parameters, we used RMSprop with a learning rate of
1e-2, following Lorraine et al. (2020).
Computational information. All models and
training were implemented in PyTorch and run on
a single NVIDIA V100 GPU.
C.2. Additional results
AUPRC results at 1000 samples. As discussed
in Section 5.2, the improvements in AUROC are not
always statistically significant. Given that some of
the labels are very low prevalence (RVH: 1%, AFib:
5%, low CO: 4%), we evaluate the AUPRC in the
low-sample regime, which provides additional information about model performance. Results are shown
in Tables 4, 5, and 6. We observe that for the low
prevalence RVH, AFib, and Low CO tasks, TaskAug
obtains statistically significant improvements in performance. On Dataset A tasks (RVH and AFib), it
is the only method to do so.
20
Data Augmentation for Electrocardiograms
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
MI
HYP
STTC
59.2±2.1
63.1±1.7
62.0±1.6
61.1±1.2
61.7±1.6
60.3±1.3
53.1±1.7
55.2±0.9
41.2±2.9
53.9±1.6
54.5±1.5
52.8±1.8
66.9±2.5
68.7±1.3
65.9±1.0
67.9±1.1
68.8±1.5
68.8±1.2
CD
67.3±1.1
66.8±1.2
62.7±1.1
64.7±2.6
65.8±1.4
70.1±1.3
Table 5: Mean and standard error of AUPRC for various data augmentation strategies on detecting cardiac abnormalities on Dataset B. We consider a low-sample regime with a development set of 1000 data points. The
best-performing method is bolded, and the second best is underlined, and ∗ indicates statistically significant
improvement at the p < 0.05 level.
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
Low CO
High PCWP:
N = 1000
High PCWP:
All Data
7.2 ± 0.4
8.8 ± 0.6∗
8.8 ± 0.6∗
8.1 ± 0.7
7.8 ± 0.4
8.0 ± 0.5
42.5 ± 0.8
43.5 ± 0.9
41.9 ± 0.7
41.2 ± 0.7
42.3 ± 1.1
42.4 ± 0.7
49.7 ± 0.8
50.8 ± 0.8
46.9 ± 0.7
49.7 ± 1.0
50.3 ± 0.8
50.1 ± 0.9
Table 6: Mean and standard error of AUPRC for various data augmentation strategies for the hemodynamics inference task in Dataset C. We consider a low-sample regime with a development set of 1000 data points.
The best-performing method is bolded, and the second best is underlined, and ∗ indicates statistically
significant improvement at the p < 0.05 level. TaskAug is the one of only two methods to obtain significant
improvements in performance on the low CO detection task.
21
Data Augmentation for Electrocardiograms
Results at higher sample regimes. Tables 7-10
show AUROC for the different augmentation methods on the tasks from Datasets A and B. We observe that augmentations are less effective at higher
samples. Particularly when the development set
sizes are 2500 and 5000 datapoints, we observe that
the improvement with using augmentations (over the
NoAugs baseline) with any of the methods is quite
small, and nearly always less than 1% AUROC. This
suggests that in general, augmentations are less useful at these higher data regimes.
RVH
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
86.1±0.9
86.9±0.9
85.5±1.3
84.8±1.3
83.3±1.8
85.8±1.1
AFib
89.0 ± 0.4
89.1 ± 0.4
89.1 ± 0.5
88.4 ± 0.5
89.1 ± 0.3
88.2 ± 0.4
Table 7: Mean and standard error of AUROC for augmentation methods on Dataset A tasks with a
development set of 2500 data points. The best
performing method is bolded, and the second
best is underlined.
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
RVH
AFib
90.6±0.6
90.6±0.4
89.8±0.6
90.8±0.5
90.5±0.8
89.4±0.7
92.6±0.2
92.8±0.1
92.6±0.2
92.5±0.2
92.7±0.1
92.6±0.2
Table 8: Mean and standard error of AUROC for augmentation methods on Dataset A tasks with a
development set of 5000 data points. The best
performing method is bolded, and the second
best is underlined.
22
Data Augmentation for Electrocardiograms
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
MI
HYP
STTC
84.5±0.5
86.1±0.5
84.7±0.7
84.1±0.5
84.6±0.8
85.7±0.4
86.4±0.4
86.2±0.4
81.9±1.3
85.9±0.6
86.2±0.6
86.6±0.3
89.7±0.3
89.7±0.3
88.7±0.4
89.5±0.3
90.2±0.3
90.1±0.1
CD
85.8±0.3
86.6±0.4
85.5±0.6
86.2±0.3
86.8±0.6
87.0±0.7
Table 9: Mean and standard error of AUROC for augmentation methods on Dataset B tasks with a development
set of 2500 data points. The best performing method is bolded, and the second best is underlined.
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
MI
HYP
STTC
89.4±0.3
89.4±0.3
86.6±0.7
88.6±0.3
89.5±0.2
89.3±0.3
88.2±0.2
88.3±0.2
86.7±0.4
88.0±0.2
88.4±0.4
88.6±0.2
91.0±0.3
91.6±0.2
90.6±0.3
91.3±0.1
91.6±0.2
91.6±0.2
CD
89.3±0.4
90.0±0.2
88.0±0.3
89.3±0.2
89.9±0.2
89.8±0.2
Table 10: Mean and standard error of AUROC for augmentation methods on Dataset B tasks with a development
set of 5000 data points. The best performing method is bolded, and the second best is underlined.
23
Data Augmentation for Electrocardiograms
Results on Dataset B at N = 500. Table 11
shows AUROC for the different augmentation methods in an additional low sample regime, with N =
500. We see that the maximum improvement over
the NoAugs baseline by any augmentation strategy
is greater in this regime than it was at N = 1000 (see
Table 2). Given that the prevalence of these tasks is
relatively high, we see more significant performance
improvements in the N = 500 regime.
24
Data Augmentation for Electrocardiograms
NoAugs
TaskAug
SMOTE
DGW
SpecAug
TimeMask
MI
HYP
STTC
74.4 ± 0.9
78.4 ± 0.5
75.7 ± 1.2
78.2 ± 0.6
77.8 ± 0.7
77.8 ± 1.0
81.9 ± 0.8
81.5 ± 1.2
79.2 ± 1.5
78.7 ± 1.2
81.0 ± 0.6
80.9 ± 1.3
85.2 ± 0.5
86.2 ± 0.4
85.5 ± 0.3
82.0 ± 1.3
86.3 ± 0.4
86.6 ± 0.5
CD
78.9 ± 1.2
80.7 ± 0.6
78.6 ± 1.5
79.0 ± 0.9
79.3 ± 1.1
80.3 ± 0.8
Table 11: Mean and standard error of AUROC for augmentation methods on Dataset B tasks with a development
set of 500 data points. The best performing method is bolded, and the second best is underlined.
25
Data Augmentation for Electrocardiograms
Interpreting the RVH policy. We visualize the
TaskAug policy for RVH in Figure 12. We observe
high probability assigned to selecting two temporal
operations in stage 1, namely masking and displacement. Relative magnitudes of different portions of
the ECG affect the RVH label, so temporal operations having higher probability of selection is sensible since they are more likely to be label preserving
than operations that change the relative magnitudes
of different parts of the ECG. We examine the learned
strengths for the displacement operation in Stage 1,
Figure 12(b), and we see that there is little differentiation on a per-class basis. This is sensible, since we
do not expect displacement of the signal in time to
affect the RVH label for differently for the positive
and negative classes.
26
Data Augmentation for Electrocardiograms
Probability of Operation Selection
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Stage 1
Stage 2
0.65
Stage 1 Displacement Strength
0.60
0.55
warp wander noise mask disp
scale no-op
warp wander noise mask disp
(a) Operation selection probabilities
scale no-op
0.50
Class 0 (No RVH)
Class 1 (RVH)
(b) Displacement strengths
Figure 12: TaskAug policy for detecting Right Ventricular Hypertrophy. The learned TaskAug policy:
probability of selecting each transformation in both augmentation stages and the optimized displacement
strengths in the first stage. We show the mean/standard error of the learned parameter values over 15
runs. Temporal operations (masking and displacement) have high probability of selection in Stage 1,
which is sensible since these operations are likely to be label preserving (RVH is typically detected based
on relative magnitudes of portions of beats in the ECG). We see that both positive and negative classes
have similar optimized displacement augmentation strengths – we do not expect displacement to impact
the class label differently for the two classes, so this is sensible.
27
Data Augmentation for Electrocardiograms
Further study on the impact of optimizing
augmentations. As shown in the main text, Figure 5, optimizing the policy parameters improves performance over keeping them fixed at their initial values. In Figure 13, we study this effect across different
dataset sizes and find that the optimization has the
most impact in the low sample regime, but still results in improvements even at higher samples. This
could be due to the fact that at higher samples, augmentations boost performance less in general, so the
specific parameter settings in TaskAug also have less
impact.
Further study on the impact of class-specific
magnitudes. As shown in the main text, Figure
6, optimizing class-specific magnitudes improves over
learning one magnitude parameter for each class. Figure 14 studies this effect across different dataset sizes
and we see that the class-specific parameters improve
performance at all dataset sizes, but the improvement
is most clearly seen at low samples. Similarly with the
optimization of augmentation parameters, this could
be due to the fact that at higher samples, augmentations boost performance less in general, so the classspecific parameterization in TaskAug has less impact.
28
Data Augmentation for Electrocardiograms
AFib
92.5
MI
90
90.0
85
AUROC
AUROC
87.5
85.0
80
82.5
Init Augs
Optimized Augs
80.0
1000
2500
Number of datapoints
Init Augs
Optimized Augs
75
5000
5001000
2500
Number of datapoints
5000
Figure 13: Studying performance when we do not optimize the policy parameters in TaskAug. We
show the mean/standard error of AUROC over 15 runs for AFib and over 5 runs for MI. We see that
optimizing the policy parameters results in noticeable improvements in performance over keeping the
policy parameters at their initial values (InitAugs). However, the impact of optimizing the parameters
is reduced at larger dataset sizes, possibly due to the fact that augmentations are inherently less useful
at higher sample regimes.
90.0
87.5
85.0
82.5
80.0
77.5
75.0
MI
AUROC
AUROC
92.5
90.0
87.5
85.0
82.5
80.0
77.5
AFib
Not Class-Specific Mags
Class-Specific Mags
1000
2500
Number of datapoints
5000
Not Class-Specific Mags
Class-Specific Mags
5001000
2500
Number of datapoints
5000
Figure 14: Studying performance when we do not have class-specific magnitude parameters in
TaskAug. We show the mean/standard error of AUROC over 15 runs for AFib and over 5 runs for
MI. Class-specific magnitude parameters improve performance most in the low sample regime. At higher
samples, this impact is reduced, possibly due to the fact that augmentations are inherently less useful at
higher sample regimes.
29