Academia.eduAcademia.edu

Data Augmentation for Electrocardiograms

2022

Neural network models have demonstrated impressive performance in predicting pathologies and outcomes from the 12-lead electrocardiogram (ECG). However, these models often need to be trained with large, labelled datasets, which are not available for many predictive tasks of interest. In this work, we perform an empirical study examining whether training time data augmentation methods can be used to improve performance on such data-scarce ECG prediction problems. We investigate how data augmentation strategies impact model performance when detecting cardiac abnormalities from the ECG. Motivated by our finding that the effectiveness of existing augmentation strategies is highly task-dependent, we introduce a new method, TaskAug , which defines a flexible augmentation policy that is optimized on a per-task basis. We outline an efficient learning algorithm to do so that leverages recent work in nested optimization and implicit differentiation. In experiments, considering three datasets and eig...

Proceedings of Machine Learning Research LEAVE UNSET:1–28, 2022 Conference on Health, Inference, and Learning (CHIL) 2022 Data Augmentation for Electrocardiograms Aniruddh Raghu [email protected] Massachusetts Institute of Technology, USA Divya Shanmugam [email protected] Massachusetts Institute of Technology, USA Eugene Pomerantsev [email protected] arXiv:2204.04360v1 [cs.LG] 9 Apr 2022 Massachusetts General Hospital, USA John Guttag [email protected] Massachusetts Institute of Technology, USA Collin M. Stultz [email protected] Massachusetts Institute of Technology, USA Abstract Neural network models have demonstrated impressive performance in predicting pathologies and outcomes from the 12-lead electrocardiogram (ECG). However, these models often need to be trained with large, labelled datasets, which are not available for many predictive tasks of interest. In this work, we perform an empirical study examining whether training time data augmentation methods can be used to improve performance on such datascarce ECG prediction problems. We investigate how data augmentation strategies impact model performance when detecting cardiac abnormalities from the ECG. Motivated by our finding that the effectiveness of existing augmentation strategies is highly task-dependent, we introduce a new method, TaskAug, which defines a flexible augmentation policy that is optimized on a per-task basis. We outline an efficient learning algorithm to do so that leverages recent work in nested optimization and implicit differentiation. In experiments, considering three datasets and eight predictive tasks, we find that TaskAug is competitive with or improves on prior work, and the learned policies shed light on what transformations are most effective for different tasks. We distill key insights from our experimental evaluation, generating a set of best practices for applying data augmentation to ECG prediction problems. Data and Code Availability We use three datasets: two are from Massachusetts General Hospital (MGH) and are not publicly available; the third is PTB-XL (Wagner et al., 2020), which is publicly available on the PhysioNet repository (Goldberger et al., 2000). Code implementing our method is available here: https://github.com/aniruddhraghu/ ecg_aug. 1. Introduction Electrocardiography is used widely in medicine as a non-invasive and relatively inexpensive method of measuring the electrical activity in an individual’s heart. The output of electrocardiography — the electrocardiogram (ECG) — is of great utility to clinicians in diagnosing and monitoring various cardiovascular conditions (Salerno et al., 2003; Fesmire et al., 1998; Blackburn et al., 1960). In recent years, there has been significant interest in automatically predicting cardiac abnormalities, diseases, and outcomes directly from ECGs using neural network models (Hannun et al., 2019; Raghunath et al., 2020; Gopal et al., 2021; Diamant et al., 2021; Kiyasseh et al., 2021; Raghu et al., 2021a). Although these works demonstrate impressive results, they often require large labelled datasets with paired ECGs and labels to train models. In certain situations, it is challenging to construct such datasets. For example, consider inferring abnormal central hemodynamics (e.g., cardiac output) from the ECG, which is important when monitoring patients with heart failure or pulmonary hypertension (Schlesinger et al., 2021). Accurate hemodynamics labels are only obtainable through specialized invasive studies (Bajorat et al., 2006; Hiemstra et al., 2019) and hence it © 2022 A. Raghu, D. Shanmugam, E. Pomerantsev, J. Guttag & C.M. Stultz. Data Augmentation for Electrocardiograms Task 1: RVH Detection 80 Task 2: AFib Detection ferent predictive tasks, which cover different classes of cardiac abnormalities. • We analyze the results from our evaluation, finding that many augmentation strategies do not work well across all tasks. Given its task-specific nature, TaskAug is competitive with or improves on other methods for the problems we examined. • We study the learned TaskAug policies, finding that they offer insights as to what augmentations are most appropriate for different tasks. • We provide a summary of findings and best practices to assist future studies exploring data augmentation for ECG tasks. 80 AUROC AUROC 75 75 70 65 No Augmentations 70 With SpecAugment No Augmentations With SpecAugment Figure 1: The effect of data augmentation on ECG prediction tasks is task-dependent. We examine the mean/standard error of AUROC over 5 runs when applying SpecAugment (Park et al., 2019), a data augmentation method, to two different ECG prediction tasks. We observe performance improvement in one setting (left, Right Ventricular Hypertrophy), and performance reduction in another (right, Atrial Fibrillation). 2. Related Work Data augmentation for time-series. Prior research on time-series data augmentation includes: (1) large-scale surveys exploring the impact of augmentation on various downstream modalities (Iwana and Uchida, 2021a,b; Wen et al., 2020); and (2) specific is difficult to obtain large datasets with paired ECGs methods for particular modalities, including speech and hemodynamics variables. et al., 2019, 2020), wearable device sigsignals (Park Data augmentation (Hataya et al., 2020; Wen et al., et al., 2017), and time series forecasting nals (Um 2020; Shorten and Khoshgoftaar, 2019; Iwana and (Bandara et al., 2021; Smyl and Kuber, 2016). There Uchida, 2021a; Cubuk et al., 2019, 2020) during trainis relatively little work exploring how augmentation ing is a useful strategy to improve the predictive percan impact performance for ECG-based prediction formance of models in data-scarce regimes. However, tasks, with prior studies mostly restricted to considerthere exists limited work studying data augmentation et al., 2020; Banerjee and ing single tasks (Hatamian for ECGs. A key problem with applying standard 2021). In contrast, in this paper, we evaluate Ghose, data augmentations is that fine-grained information a set of data augmentation methods on many different within ECGs, such as relative amplitudes of portions predictive tasks, studying when and why augmentaof beats, carry predictive signal: augmentations may tions may help. In addition, the data augmentation worsen performance if such predictive signal is destrategy proposed in this work, TaskAug, can be readstroyed. Furthermore, the effectiveness of data augily adapted to new predictive tasks, unlike in existing mentations with ECGs varies on a task-specific basis works where the methods may be designed for a very – applying the same augmentation for two different specific downstream task. tasks could help performance in one case, and hurt There also exists related work on using data augperformance in another (Figure 1). mentation for contrastive pre-training with ECGs In this work, we take steps towards addressing (Gopal et al., 2021; Kiyasseh et al., 2021; Raghu these issues. Our contributions are as follows: et al., 2021b; Mehari and Strodthoff, 2021). These • We propose TaskAug, a new task-dependent aug- works are complementary to ours; we focus specifimentation strategy. TaskAug defines a flexible aug- cally on supervised learning (rather than contrastive mentation policy that is optimized on a per-task pre-training), and we hypothesize that our proposed basis. We outline an efficient learning algorithm augmentation pipeline could be used in these prior to do so that leverages recent work in nested op- methods for improved contrastive learning. timization and implicit differentiation. (Lorraine et al., 2020). Designing and learning data augmentation • We conduct an empirical study of TaskAug and policies. The structure of TaskAug, our proposed other augmentation strategies on ECG predictive augmentation strategy, was inspired by related work problems. We consider three datasets and eight dif- on flexible data augmentation policies in computer vi2 Data Augmentation for Electrocardiograms 4. Data Augmentation Methods sion (Cubuk et al., 2019, 2020; Hataya et al., 2020). We extend these ideas to ECG predictive tasks by (1) selecting appropriate transformations for ECG data, and (2) allowing for class-specific transformation strengths. Since such policies introduce many hyperparameters, we use a bi-level optimization algorithm to enable scalable policy learning (Lorraine et al., 2020; Raghu et al., 2021c). We now describe the data augmentation methods considered in our experiments. We also present our new, learnable data augmentation method that can be used to find task-specific augmentation policies, and an algorithm to optimize its parameters. 4.1. Existing Data Augmentation Methods We evaluate the following set of existing data augmentation strategies, which includes operations in the signal (time-domain) space, frequency space, and interpolated signal space, providing good coverage of the possible space of augmentations. 3. Problem Setup and Notation We focus on supervised binary classification problems from ECG data. Let x ∈ R12×T refer to a 12-lead ECG of T samples and y ∈ {0, 1} refer to a binary target. We let D = {(xn , yn )}N n=1 refer to a dataset of N ECG-label pairs. Let f (x; θ) → ŷ be a neural network model with parameters θ that outputs a predicted label ŷ given x as input. Network parameters are optimized to minimize the average binary cross entropy loss LBCE on the training dataset D(train) . We restrict our study to single label binary classification problems in this work in order to study the effect of data augmentation on a per-task basis. One can extend this to multilabel binary classification by letting y be a vector of several different binary labels and training the network to produce a vector of predictions. Time Masking. This is a commonly used method in time-series and ECG data augmentation work (Iwana and Uchida, 2021a; Gopal et al., 2021). We mask out (set to zero) a contiguous fraction w ∈ [0, 1] of the original signal of length T , We choose a random starting sample ts and set all samples [ts , ts +wT ] = 0. SpecAugment. A highly popular method for augmenting speech signals (Park et al., 2019, 2020). We follow the approach from Kiyasseh et al. (2021), and apply masking (setting components to zero) in the time and frequency domains as follows. We take the Short-Time Fourier Transform (STFT) of the input signal, and independently mask a fraction w of the temporal bins and frequency bins (this involves setting the complex valued entries in these bins to 0 + 0j). The inverse STFT is then used to map the signal back to the time domain. Training with Data Augmentation. Let A(x, y; φ) → x̃ refer to a data augmentation function with hyperparameters φ that takes the input ECG x and its label y and outputs an augmented version x̃. Note that this formulation implicitly assumes that the augmentation is label preserving, since it does not also change the label y. Where relevant, the augmentation hyperparameters φ may control the strength/probability of applying an augmentation. The process of training with data augmentation 1 amounts to: 1. Sample a data point and label pair from the training set: (x, y) ∼ D(train) . 2. Apply the augmentation A : x 7→ x̃, to transform the original input x to an augmented version x̃. 3. Use the pair (x̃, y) in training. Discriminative Guided Warping (DGW). Introduced in Iwana and Uchida (2021b), this method uses Dynamic Time Warping (DTW) (Müller, 2007; Berndt and Clifford, 1994) to warp a source ECG to match a representative reference signal that is dissimilar to examples from other classes. SMOTE (Chawla et al., 2002). A commonly used oversampling strategy, the SMOTE algorithm generates new synthetic examples of the minority class by interpolating minority class samples. Given that many ECG prediction problems are characterized by significant class imbalance, oversampling algorithms are important methods to consider. In contrast to the other methods, the SMOTE algorithm generates an augmented dataset prior to any training, 1. For the SMOTE baseline this process is slightly different; based on a predefined training set size, rather than details are in Section 4. augmenting examples at each training iteration (as 3 Data Augmentation for Electrocardiograms ui allows differenThe multiplicative factor stop grad(u i) tiation w.r.t the operation selection parameters π. This enables gradient-based optimization of π (see Section 4.2.2). The denominator is necessary because the reparameterized sample from the categorical distribution is not one-hot. Further details are in Appendix A. Suppose a particular augmentation function Ai with strength parameters µ0 and µ1 is obtained following Eqns 1 and 2. Then, denoting the input to this augmentation stage as x with label y, the function Ai that computes the augmented output is defined as: presented in Section 4). We set this value to achieve a balanced number of the two classes. 4.2. TaskAug : A New Augmentation Policy Motivation. The approaches mentioned so far are simple to implement and can be effective for various problems; however, they are fairly inflexible, given each individually uses only one or two fixed transformations. With ECGs, recall that it is unclear on a per-task basis which augmentations may help or worsen performance (Figure 1). Designing a more flexible augmentation strategy that is optimized on a per-task basis could help with this problem, and we now describe such an approach – TaskAug. Ai (x, y; µ0 , µ1 ) = ti (x; s), (4) where ti is the actual transformation applied to the signal (e.g., time masking), and s is the transformation strength, computed as follows: High-level structure. We define a set of operas = yµ 1 + (1 − y)µ0 . See Appendix A for a detailed tions S = {A1 , . . . , AM }, each of which is an augmenexample of the different steps in applying TaskAug. tation function of the form Ai (x, y; µ0 , µ1 ), where x is the input data point to the augmentation function, Extension to multiclass and multilabel sety is the label, and {µ0 , µ1 } represent the augmenta- tings. Our instantiation of TaskAug is for the bition strengths for datapoints of class label 0 and class nary classification setting, since this is the scenario label 1 respectively. We separately parameterize the we consider in our experiments. The formulation augmentation strengths for each class because trans- can be extended to multiclass/multilabel problems formations may corrupt predictive information in the by defining an operation selection probability masignal for one class but not the other. trix and strength matrix at each augmentation stage. The overall augmentation policy consists of a set of The operation selection probabilities and operation K stages, where at each stage we: (1) sample an aug- strengths for a given example are then obtained by mentation function Ai to apply; and (2) apply it to taking the matrix product of the relevant parameter the input signal to that stage. This allows composing matrix and the label vector y. combinations of operations in a stochastic manner. A high-level schematic is shown in Figure 2. 4.2.2. Optimizing Policy Parameters Mathematical definition. The policy is defined following Hataya et al. (2020). At each augmen- Although the defined policy is flexible, it introduces tation stage k ∈ {1, . . . , K} we have a set of op- many new parameters – for a binary problem, there eration selection parameters π (k) ∈ [0, 1]M , where are M operation selection parameters for the cateP (k) gorical distributions at each stage, and 2 strength pa= 1 ∀k. Each vector π (k) parameterizes i πi rameters at each stage, resulting in K × (2 + M ) total (k) a categorical distribution such that each entry πi parameters. Finding effective values for these paramrepresents the probability of selecting operation i at eters with random/grid search or Bayesian optimizaaugmentation stage k. We obtain a reparameteriz- tion is computationally expensive since they require able sample from this categorical distribution (using training models many times with different parameter the Gumbel-Softmax trick, (Jang et al., 2016; Maddi- settings. We therefore use a gradient-based learning son et al., 2016)) at each stage to select the operation scheme to learn these parameters online. to use, as follows: We optimize policy parameters to minimize a model’s validation loss, which is computed using non(k) M u ∼ Categorical(π ) # Note that u ∈ R (1) augmented data. Following prior work (Lorraine i = arg max u (2) et al., 2020; Hataya et al., 2020; Raghu et al., 2021c), ui Ai (x, y; µ0 , µ1 ). (3) we alternate gradient updates on the network paramx̃ = stop grad(ui ) eters θ and the augmentation parameters φ by iter4.2.1. Formalizing TaskAug 4 Data Augmentation for Electrocardiograms Figure 2: Structure of TaskAug. Augmentations to apply are sampled from a set of available operations, and applied in sequence. Here we show an example with K = 2 stages of augmentation. We omit details relating to the per-class magnitudes and probabilities of sampling for clarity. 5. Experiments ating the following steps (details and full algorithm in Appendix A): • Optimize the model parameters θ for P steps: at each step, sample a batch (x, y) of data from D(train) , augment the batch with the augmentation policy to obtain (x̃, y), compute the predicted label ŷ, and update the model parameters using gradient descent: θ ← θ − η∇L(y, ŷ). • Compute the validation loss LV using an unaugmented batch from the validation dataset. • Perform a gradient update on the augmentation parameters φ. We use the chain rule to re-express the gradient wrt the augmentation parameters: We evaluate the data augmentation strategies on ECG prediction tasks. We have two main experimental questions: (1) in what settings can data augmentation be beneficial, and (2) when data augmentation does help, which augmentation strategies are most effective? To investigate these questions, we consider a range of settings that cover three different 12-lead ECG datasets and eight prediction tasks of varying difficulty, class imbalance, and training set sizes. 5.1. Experimental Setup 5.1.1. Datasets and Tasks ∂LV ∂LV ∂θ = × , ∂φ ∂θ ∂φ We highlight key information about our datasets and tasks here, with a summary in Table 1. and compute this as follows. The first term on the RHS is found exactly using straightforward backpropagation; the second term is approximated using the algorithm from Lorraine et al. (2020), leveraging implicit differentiation for efficient computation (since differentiating through training exactly is too memory-intensive). The augmentation paV rameters are then updated: φ ← φ − η ∂L ∂φ . By using this algorithm, augmentation parameters are learned on a per-task basis and analyzing the learned parameters may allow us to understand what augmentations are useful for different problems. We return to this in Section 5.2.2. Dataset A is from Massachusetts General Hospital (MGH) and contains paired 12-lead ECGs and labels for different cardiac abnormalities. Of the available labels in the dataset, we select Right Ventricular Hypertrophy (RVH) and Atrial Fibrillation (AFib) as two of the predictive tasks in our evaluation. These were chosen because (1) they have been previously studied as prediction targets from the ECG (Couceiro et al., 2008; Lin and Lu, 2020), and (2) they have low positive prevalence: 1% for RVH, and 5% for AFib, and therefore help to understand the impact of data augmentation in imbalanced prediction problems. Computational cost. Optimizing policy parameters in this manner is significantly more computationally efficient than running a grid search over parameter values. With P = 1, running this algorithm has about 2 − 3× the computational cost of training without any augmentations. Dataset B is PTB-XL (Wagner et al., 2020; Goldberger et al., 2000), an open-source dataset of 12-lead ECGs. Each ECG has labels for four different categories of cardiac abnormality. This dataset has been used in prior work to evaluate ECG predictive models (Gopal et al., 2021; Kiyasseh et al., 2021). 5 Data Augmentation for Electrocardiograms Dataset Task name Prevalence Abnormality type #ecgs/#patients Dataset A Right Ventricular Hypertrophy (RVH) Atrial Fibrillation (AFib) 1% 5% Structural Electrical 705057/705057 705057/705057 Dataset B (PTB-XL) Hypertrophy (HYP) ST/T Change (STTC) Conduction Disturbance (CD) Myocardial Infarction (MI) 12% 22% 24% 25% Structural Ischemia Electrical Ischemia 21837/18885 21837/18885 21837/18885 21837/18885 Low Cardiac Ouput (CO) High Pulmonary Capillary Wedge Pressure (PCWP) 4% Hemodynamics 6290/4051 26% Hemodynamics 6290/4051 Dataset C Table 1: Summary information about the datasets and tasks considered in our empirical evaluation. Dataset C is from the same hospital (MGH) as Dataset A and contains paired ECGs and labels for two hemodynamics parameters, Cardiac Output (CO) and Pulmonary Capillary Wedge Pressure (PCWP). These measures of cardiac health are important in deciding treatment strategies for patients with cardiac disease (Yancy et al., 2013; Hurst et al., 1990; Solin et al., 1999). Typically, these parameters can only be measured accurately through an invasive cardiac catheterization procedure (Bajorat et al., 2006; Hiemstra et al., 2019). As a result, datasets with paired ECGs and hemodynamics measurements are relatively small. Considering the use of data augmentations to improve model performance in this limited data regime is therefore clinically relevant. We specifically consider inferring abnormally low Cardiac Output, and abnormally high Pulmonary Capillary Wedge Pressure. Note that the tasks considered cover different classes of cardiac abnormalities: ischemia (MI, STTC), structural (HYP, RVH), electrical (CD, AFib), and abnormal hemodynamics (low CO, high PCWP). both sets). We split the development set into an 8020 training-validation split. 5.1.2. TaskAug Transformations Based on prior work in time series and ECG data augmentation (Iwana and Uchida, 2021a; Mehari and Strodthoff, 2021) we use the following transformations in the TaskAug policy. Mathematical descriptions are in Appendix A. • Random temporal warp: The signal is warped with a random, diffeomorphic temporal transformation. This is formed by sampling from a zero mean, fixed variance Gaussian at each temporal location in the signal to obtain a velocity field, and then integrating and smoothing (following Balakrishnan et al. (2018, 2019)) to generate a temporal displacement field, which is applied to the signal. The variance is the strength parameter, with higher variance indicating more warping. • Baseline wander: A low-frequency sinusoidal component is added to the signal, with the amplitude of the sinusoid representing the strength. • Gaussian noise: IID Gaussian noise is added to Dataset splitting. Since the value of data augthe signal, with the strength parameter representmentation can depend on the amount of training ing the variance of the Gaussian. data, we train on different dataset sizes. For the non- • Magnitude scale: The signal amplitude is scaled hemodynamic tasks (Datasets A and B), we generby a number drawn from a scaled uniform distribuate development datasets with 1000, 2500, and 5000 tion, with the scale being the strength parameter. ECGs. On the more challenging hemodynamics in- • Time mask: A random contiguous section of the ference tasks (Dataset C), for elevated PCWP, we signal is masked out (set to zero). consider two settings: using a development set of size • Random temporal displacement: The entire 1000, and using the full dataset. For low CO, we signal is translated forwards or backwards in time only use the full dataset, since reducing the dataset by a random temporal offset, drawn from a uniform size led to poor quality models. distribution scaled by a strength parameter. In each setting, we split datasets into development Note that our instantiation of the augmentation poland testing sets on a patient-level (no patient is in icy could utilize many more operations, but we keep 6 Data Augmentation for Electrocardiograms it to this number for simplicity and to assist in interpreting the learned policies. regimes for both datasets (N = 1000), we focus on this setting with results shown in Table 2. Results for the higher sample regimes are in the appendix. We summarize key findings here. The value of augmentation varies by task. For some tasks such as RVH and MI, almost all augmentation strategies lead to performance improvements. On other tasks such as STTC and HYP, performance is the same or worse when applying augmentations. The improvement seen with RVH could be due to the fact that it is particularly low prevalence (1%), so all augmentation strategies have an oversampling effect and thus boost performance. TaskAug performs well on average. TaskAug almost always improves on the NoAugs baseline, and even boosts performance on some tasks where other augmentations worsen performance (AFib). Although TaskAug does not always result in a statistically significant (p < 0.05) improvement in AUROC , it is the only method to significantly improve AUPRC over NoAugs on the low-prevalance tasks, RVH and AFib (see Appendix C, Table 4). When TaskAug results in lower performance than other augmentation strategies (e.g., TimeMasking for CD), it is still competitive with these methods and never causes a statistically significant reduction in performance compared to other methods. This suggests that for a new task, it may always be worth using TaskAug to see if performance is boosted. We hypothesise that TaskAug’s efficacy is due to its flexible and learned nature, examined in ablation studies (Section 5.2.3). Performance improvements are smaller on Dataset B. The maximum improvement over the NoAugs baseline in Dataset A (5.8%) is greater than the maximum improvement in Dataset B (2.3%). We hypothesise two reasons for this. Firstly, the prevalence in Dataset B is higher, meaning that augmentations may not have as much of an effect at N = 1000. We study this in Appendix C, Table 11, where we examine performance at the N = 500 data regime for Dataset B, and find that the maximum improvement (obtained with TaskAug for MI) goes up to 4%. Secondly, Dataset A has narrower label definitions than Dataset B, and this affects performance, especially with TaskAug. The HYP, STTC, and CD classes of abnormalities in Dataset B aggregrate many sub-categories together (see Appendix B), and these sub-categories may each benefit from different augmentations. In contrast, the labels in Dataset A are fine-grained, and so TaskAug, which optimizes aug- 5.1.3. Implementation Details Network architecture. We standardize the network architecture to be a 1D convolutional network, based on the ResNet-18 architecture, since prior work has shown architectures of this form to be effective with ECG data (Diamant et al., 2021). Full architectural details are in the appendix. Training Details. On Datasets A and B, all models are trained for 100 epochs, using early stopping based on validation loss. For the hemodynamics inference problems on Dataset C, we train models for 50 epochs with early stopping (since we observed significant overfitting after this point). We consider 15 random development/testing set splits for Datasets A and C (lower prevalences for some tasks meant that performance was more variable with fewer runs), and 5 splits for Dataset B. We train models using the Adam optimizer and a learning rate of 1e-3. This value resulted in stable and effective training across all models (as compared to 1e-4, 5e-4, and 5e-3). As evaluation, we compute the AUROC of the best performing model on the held-out testing set, and report mean/standard error across runs. We also report results for a baseline (NoAugs) that does not use any data augmentation. Augmentation Hyperparameters. In TaskAug, we set the number of augmentation stages to K = 2 (defined in Section 4.2.1), following prior work (Hataya et al., 2020). For the number of model optimization steps P (defined in Section 4.2.2), we evaluate both P = 1 and P = 5, and select the best performing setting based on validation set loss. Further discussion on the choice of P is in Appendix A. For Time Masking and SpecAugment, we search over the masking window, considering w ∈ {0.1, 0.2} for SpecAugment (range based on Kiyasseh et al. (2021)) and w ∈ {0.1, 0.2, 0.5} for Time Masking (range based on Gopal et al. (2021)). 5.2. Results 5.2.1. Quantitative results Non-hemodynamics tasks. We first analyze performance of augmentation strategies on the nonhemodynamics tasks. Given that performance improvements are most evident in the lowest sample 7 Data Augmentation for Electrocardiograms Dataset A RVH AFib NoAugs TaskAug SMOTE DGW SpecAug TimeMask 72.6 ± 2.7 78.4 ± 1.9 75.9 ± 1.8 73.6 ± 1.7 77.9 ± 1.7 72.8 ± 2.1 Dataset B HYP STTC MI 79.8 ± 1.4 82.8 ± 1.0 79.0 ± 1.4 77.4 ± 1.5 77.2 ± 2.1 77.9 ± 1.9 80.0 ± 0.8 82.3 ± 0.5∗ 81.2 ± 0.6 81.1 ± 0.6 81.1 ± 0.7 81.1 ± 1.3 84.3 ± 1.4 83.7 ± 0.5 80.4 ± 0.6 83.9 ± 0.7 83.5 ± 0.8 82.9 ± 0.7 87.6 ± 0.8 87.8 ± 0.4 87.0 ± 0.5 87.5 ± 0.5 87.7 ± 0.4 87.7 ± 0.7 CD 82.2 ± 0.6 83.1 ± 0.4 82.6 ± 0.8 81.8 ± 1.0 82.2 ± 0.7 83.8 ± 1.1 Table 2: Augmentation strategies improve AUROC on detecting most cardiac abnormalities in the low-sample regime (N = 1000), and TaskAug is among the best-performing methods. Table shows mean and standard error of AUROC (best-performing method bolded, second best underlined, statistically significant (p < 0.05) improvement over NoAugs marked ∗ ). The impact of augmentations is task-dependent, with some tasks (such as RVH, MI) showing improved performance on average with almost all strategies, and others (HYP) showing no improvement with any strategy. TaskAug is among the best methods across tasks, and improves performance on tasks such as AFib where no other augmentations help. Low CO NoAugs TaskAug SMOTE DGW SpecAug TimeMask 65.9 68.2 66.0 68.3 66.1 66.8 ± ± ± ± ± ± 1.2 1.0 1.4 0.9 0.9 1.1 Dataset C High PCWP: N = 1000 66.7 67.9 67.2 66.4 66.4 67.3 ± ± ± ± ± ± 0.7 0.7 0.5 0.6 1.3 0.4 High PCWP: All Data 74.4 75.1 73.6 74.9 75.0 74.6 ± ± ± ± ± ± 0.5 0.4 0.5 0.4 0.4 0.4 Table 3: Training with data augmentation improves AUROC on two hemodynamics inference tasks, and TaskAug again is among the best-performing methods. Table shows mean and standard error of AUROC (best-performing method bolded, second best underlined). All methods are comparable with or improve on the no augmentation baseline for Low CO prediction, possibly because of the low prevalence of the label (4%). The performance of methods on the High PCWP task is more variable across the two sample sizes. TaskAug obtains improvements in all three settings considered. prevalence of the positive label (4%). For inferring high PCWP, at both low sample and higher samples, TaskAug obtains improvements in performance (though not significant at the p < 0.05 level); however, other methods do not consistently improve on the no augmentation baseline. Although improvements in AUROC are not statistically significant, we observe significant improvements with TaskAug in AUPRC for low CO detection (see Appendix C, Table 6). Again, we see that the benefit of augmentation varies with the task, prevalance, and dataset size, and that TaskAug is better than or competitive with other strategies. mentations on a per-task basis, learns more appropriate augmentation strategies. This hypothesis is supported by the fact that with MI (a more fine-grained label than HYP, CD, and STTC) we observe improvements over the NoAugs baseline (clearly seen in the N = 500 regime, Appendix C, Table 11). Performance improvements at higher samples are lower, as seen in the results in Appendix C. Augmentations do not worsen performance however, and some tasks (STTC, CD) benefit a small amount, ∼ +1% AUROC. Hemodynamics tasks. Table 3 presents results for performance on the more challenging hemodynamics prediction tasks. All methods are comparable with or improve on the no augmentation baseline for low CO prediction, likely because of the low 8 Data Augmentation for Electrocardiograms Stage 1 Probability of Operation Selection Stage 2 0.4 Stage 1 Temporal Warp Strength 1.1 0.3 1.0 0.2 0.9 0.1 0.0 1.2 warp wander noise mask disp scale no-op warp wander noise mask disp scale no-op 0.8 (a) Operation selection probabilities Class 0 (No AFib) Class 1 (AFib) (b) Warp strengths Figure 3: The TaskAug policy for Atrial Fibrillation detection. We focus on the probability of selecting each transformation in both augmentation stages (left) and the optimized temporal warp strengths in the first stage (right). We show the mean/standard error of these optimized policy parameters over 15 runs. Given the characteristic features of AFib (e.g., irregular R-R interval), Time Masking is likely to be label preserving and therefore it is sensible that it has a high probability of selection. The temporal warp strength for positive samples is higher than that for negative samples, which makes sense since time warping a negative sample too strongly could change its label. 0.4 Stage 1 Probability of Operation Selection Stage 2 0.3 0.6 0.5 0.2 0.4 0.1 0.0 Stage 2 Magnitude Scale Strength warp wander noise mask disp scale no-op warp wander noise mask disp scale no-op (a) Operation selection probabilities 0.3 Class 0 (Normal PCWP) Class 1 (High PCWP) (b) Magnitude scale strength Figure 4: The TaskAug policy for detecting elevated Pulmonary Capillary Wedge Pressure. We focus on the probability of selecting each transformation in both augmentation stages (left) and the optimized magnitude scaling strengths in the second stage (right). We show the mean/standard error of these optimized policy parameters over 15 runs. There exists little domain knowledge about what features in the ECG may encode elevated PCWP, so examining the learned augmentations here could provide hypotheses of invariances in the data. Of interest is that the positive class is augmented with stronger magnitude scaling than the negative class, suggesting that scaling negative examples could affect their labels. a sensible choice. Considering the learned time warp strength in Figure 3(b), we observe that signals labelled negative for AFib are warped less strongly than those with AFib, again sensible since time warping may affect the label of a signal and introduce AFib in a signal where it was not originally present. 5.2.2. Analyzing learned policies We analyze the learned policies for three of the predictive tasks: AFib, PCWP, and RVH (appendix). AFib, Figure 3. We see that time mask has a high probability of selection (Figure 3(a)). Since AFib is characterized in the ECG by an irregular R peak- PCWP, Figure 4. We have limited domain unR peak interval (Couceiro et al., 2008), which is of- derstanding of what augmentations may be label preten present regardless of which section of ECG is se- serving and help model performance, since detecting lected, time masking is likely label preserving, and is high PCWP from ECGs is not something clinicians 9 Data Augmentation for Electrocardiograms MI 84 AFib 84 82 AUROC 80 80 78 No Augs Init Augs Optimized Augs 78 78 76 76 80 80 78 76 76 No Augs Init Augs Optimized Augs No Augs TaskAug Ablation TaskAug (Not Class-Specific) Figure 5: Optimizing the TaskAug policy param- MI 82 82 AUROC AUROC 82 84 AUROC AFib 84 No Augs TaskAug Ablation TaskAug (Not Class-Specific) Figure 6: Class-specific magnitude parameters in eters results in performance improvements. We show the mean/standard error of AUROC over 15 runs for AFib and over 5 runs for MI. Without optimizing policy parameters (InitAug), performance is comparable to not using augmentations at all, indicating the importance of learning the policy parameters. TaskAug lead to improvements in performance. We show the mean/standard error of AUROC over 15 runs for AFib and over 5 runs for MI. This is particularly true for tasks such as AFib where some operations may not be label preserving. Appendix C, we study this at different dataset sizes and find that performance is improved by optimization at each size. are typically able to do (Schlesinger et al., 2021). Analyzing the augmentations could provide hypotheses about what features in the data encode the class label. Noise, displacement, and baseline wander all obtain higher weight in the first stage, and scaling obtains higher weight in the second stage. The high weight assigned to noise could be to help the model build invariance to it, and not use it as a predictive aspect of the signal. Studying the magnitude scaling in Figure 4(b), we see positive examples are scaled significantly more than negative examples. It is possible that negative examples are more sensitive to scale, and scaling them pushes them into positive example space. The positive examples may have more variance in scaling, and thus scaling them further has less of an effect. How much do class-specific magnitudes help? TaskAug instantiates magnitude parameters for the augmentation operations on a per-class basis, as described in Section 4.2.1, allowing positive and negative examples to be augmented differently. We examine this further, considering the AFib and MI detection tasks and N = 1000. We compare performance using class-specific magnitude parameters (the positive and negative examples have independent augmentation magnitudes µ1 and µ0 ) vs. using global magnitude parameters (the positive and negative examples are forced to have the same augmentation magnitude: µ = µ0 = µ1 ). Results are shown in Figure 6. We observe noticeable improvements in 5.2.3. Ablation Studies performance with class-specific magnitude parameHow much does optimizing augmentations ters, demonstrating the importance of independently help? Our results show that TaskAug offers im- specifying magnitudes for the two classes. In Approvements in performance. In Figure 5, we exam- pendix C, we study this at different dataset sizes and ine how the actual optimization of the augmentation find that performance is improved at each size. policy parameters (operation selection probabilities and magnitudes, Section 4.2.1) affects performance, 5.2.4. Summary and best practices considering the AFib and MI detection tasks and N = 1000. We compare the performance of optimiz- • Training with data augmentations does not always ing the policy parameters vs. keeping them fixed at improve model performance, and may even hurt it. The impact of augmentation depends on nature of their initialized values and training. We observe imthe task, positive class prevalence, and dataset size. provements in performance through the optimization process, suggesting that it is not only the range of • Augmentations are most often useful in the lowsample regime. Where the prevalence is particuaugmentations that leads to improved performance, larly low (see results for RVH detection) various but also the optimization of the policy parameters. In 10 Data Augmentation for Electrocardiograms augmentation strategies improve performance, perhaps by functioning as a form of oversampling. • Data augmentations do not always improve performance at high sample sizes, but do not hurt it. • TaskAug, our proposed augmentation strategy, is the most effective method on average, and could therefore be the first augmentation strategy one tries on a new ECG prediction problem. TaskAug defines a flexible augmentation policy that is optimized on a task-dependent basis, which directly contributes to its effectiveness. • TaskAug also offers insights as to what augmentations are most effective for a given problem, which could be useful in novel prediction tasks (e.g., hemodynamics inference) to suggest what aspects of the ECG determine the class label. 6. Conclusion In this work, we studied the use of data augmentation for prediction problems from 12-lead electrocardiograms (ECGs). We outlined TaskAug, a new, learnable data-augmentation strategy for ECGs, and conducted an empirical study of this method and several existing augmentation strategies. In our experimental evaluation on three ECG datasets and eight distinct predictive tasks, we find that data augmentation is not always helpful for ECG prediction problems, and for some tasks may worsen performance. Augmentations can be most helpful in the low-sample regime, and specifically when the prevalence of the positive class is low. Our proposed learnable augmentation strategy, TaskAug, was among the strongest performing methods in all tasks. TaskAug augmentation policies are additionally interpretable, providing insight as to what transformations are most important for different problems. Future work could consider applying TaskAug to other settings (e.g., multiview contrastive learning) and modalities (e.g., EEGs) where flexible augmentation policies may be useful and could be interpreted to provide scientific insight. 11 Data Augmentation for Electrocardiograms Institutional Review Board (IRB) Ricardo Couceiro, Paulo Carvalho, Jorge Henriques, Manuel Antunes, Matthew Harris, and Jörg Habetha. Detection of atrial fibrillation using modelbased ecg analysis. In 2008 19th International Conference on Pattern Recognition, pages 1–5. IEEE, 2008. This study was approved by the Institutional Review Board (IRB) at Massachusetts General Hospital (protocol 2020P000132). References Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113– 123, 2019. J Bajorat, R Hofmockel, DA Vagts, M Janda, B Pohl, C Beck, and G Noeldge-Schomburg. Comparison of invasive and less-invasive techniques of cardiac output measurement under different haemodynamic conditions in a pig model. European journal of anaesthesiology, 23(1):23–30, 2006. Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. Guha Balakrishnan, Amy Zhao, Mert Sabuncu, John Guttag, and Adrian V. Dalca. An unsupervised learning model for deformable medical image registration. CVPR: Computer Vision and Pattern Recognition, pages 9252–9260, 2018. Guha Balakrishnan, Amy Zhao, Mert Sabuncu, John Guttag, and Adrian V. Dalca. Voxelmorph: A learning framework for deformable medical image registration. IEEE TMI: Transactions on Medical Imaging, 38:1788–1800, 2019. Nathaniel Diamant, Erik Reinertsen, Steven Song, Aaron Aguirre, Collin Stultz, and Puneet Batra. Patient contrastive learning: a performant, expressive, and practical approach to ecg modeling. 2021. Kasun Bandara, Hansika Hewamalage, Yuan-Hao Liu, Yanfei Kang, and Christoph Bergmeir. Improving the accuracy of global forecasting models using time series data augmentation. Pattern Recognition, 120:108148, 2021. Francis M Fesmire, Robert F Percy, Jim B Bardoner, David R Wharton, and Frank B Calhoun. Usefulness of automated serial 12-lead ecg monitoring during the initial emergency department evaluation of patients with chest pain. Annals of emergency medicine, 31(1):3–11, 1998. Rohan Banerjee and Avik Ghose. Synthesis of realA. Goldberger, L. A. Amaral, L. Glass, Jefistic ecg waveforms using a composite generative frey M. Hausdorff, P. Ivanov, R. Mark, J. Mietus, adversarial network for classification of atrial fibG. Moody, C. Peng, and H. Stanley. PhysioBank, rillation. In 2021 29th European Signal Processing PhysioToolkit, and PhysioNet: components of a Conference (EUSIPCO), pages 1145–1149, 2021. new research resource for complex physiologic sigdoi: 10.23919/EUSIPCO54536.2021.9616079. nals. Circulation, 101 23:E215–20, 2000. Donald J Berndt and James Clifford. Using dynamic Bryan Gopal, Ryan W. Han, Gautham Raghupathi, time warping to find patterns in time series. In Andrew Y. Ng, Geoffrey H. Tison, and Pranav RaKDD workshop, volume 10, pages 359–370. Seattle, jpurkar. 3kg: Contrastive learning of 12-lead elecWA, USA:, 1994. trocardiograms using physiologically-inspired augmentations. 2021. Henry Blackburn, Ancel Keys, Ernst Simonson, Pentti Rautaharju, and Sven Punsar. The electrocardiogram in population studies: a classification Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie Bourn, system. Circulation, 21(6):1160–1175, 1960. Mintu P Turakhia, and Andrew Y Ng. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Cardiologist-level arrhythmia detection and Hall, and W Philip Kegelmeyer. Smote: synthetic classification in ambulatory electrocardiograms minority over-sampling technique. Journal of artiusing a deep neural network. Nature medicine, 25 ficial intelligence research, 16:321–357, 2002. (1):65–69, 2019. 12 Data Augmentation for Electrocardiograms Faezeh Nejati Hatamian, Nishant Ravikumar, Su- Gen-Min Lin and Henry Horng-Shing Lu. A 12laiman Vesal, Felix P Kemeth, Matthias Struck, lead ecg-based system with physiological parameand Andreas Maier. The effect of data augmentaters and machine learning to identify right ventriction on classification of atrial fibrillation in short ular hypertrophy in young adults. IEEE Journal of single-lead ecg signals using deep neural networks. Translational Engineering in Health and Medicine, In ICASSP 2020-2020 IEEE International Confer8:1–10, 2020. doi: 10.1109/JTEHM.2020.2996370. ence on Acoustics, Speech and Signal Processing Jonathan Lorraine, Paul Vicol, and David Duvenaud. (ICASSP), pages 1264–1268. IEEE, 2020. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on ArRyuichiro Hataya, Jan Zdenek, Kazuki Yoshitificial Intelligence and Statistics, pages 1540–1552. zoe, and Hideki Nakayama. Meta approach to PMLR, 2020. data augmentation optimization. arXiv preprint arXiv:2006.07965, 2020. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ation of discrete random variables. arXiv preprint Sun. Deep residual learning for image recognition. arXiv:1611.00712, 2016. In Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. Temesgen Mehari and Nils Strodthoff. Selfsupervised representation learning from 12-lead ecg Bart Hiemstra, Geert Koster, Renske Wiersema, data. arXiv preprint 2103.12676, 2021. Yoran M Hummel, Pim van der Harst, Harold Snieder, Ruben J Eck, Thomas Kaufmann, Meinard Müller. Dynamic time warping. Information Thomas WL Scheeren, Anders Perner, et al. The retrieval for music and motion, pages 69–84, 2007. diagnostic accuracy of clinical examination for estimating cardiac index in critically ill patients: Daniel S Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D Cubuk, and the simple intensive care studies-i. Intensive care Quoc V Le. Specaugment: A simple data augmedicine, 45(2):190–200, 2019. mentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019. J Hurst, C Rackley, E Sonnenblick, and N Wenger. The heart, arteries and veins, volume 1. McGrawDaniel S Park, Yu Zhang, Chung-Cheng Chiu, Hill, 1990. Youzheng Chen, Bo Li, William Chan, Quoc V Le, and Yonghui Wu. Specaugment on large scale Brian Kenji Iwana and Seiichi Uchida. An empirical datasets. In ICASSP 2020-2020 IEEE Internasurvey of data augmentation for time series clastional Conference on Acoustics, Speech and Sigsification with neural networks. Plos one, 16(7): nal Processing (ICASSP), pages 6879–6883. IEEE, e0254841, 2021a. 2020. Brian Kenji Iwana and Seiichi Uchida. Time series data augmentation for neural networks by time warping with a discriminative teacher. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 3558–3565. IEEE, 2021b. Aniruddh Raghu, John Guttag, Katherine Young, Eugene Pomerantsev, Adrian V Dalca, and Collin M Stultz. Learning to predict with supporting evidence: applications to clinical risk prediction. In Proceedings of the Conference on Health, Inference, and Learning, pages 95–104, 2021a. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv Aniruddh Raghu, Jonathan Lorraine, Simon Kornpreprint arXiv:1611.01144, 2016. blith, Matthew McDermott, and David K Duvenaud. Meta-learning to improve pre-training. AdDani Kiyasseh, Tingting Zhu, and David A Clifton. vances in Neural Information Processing Systems, Clocs: Contrastive learning of cardiac signals 34, 2021b. across space, time, and patients. In International Conference on Machine Learning, pages 5606– Aniruddh Raghu, Maithra Raghu, Simon Kornblith, 5615. PMLR, 2021. David Duvenaud, and Geoffrey Hinton. Teaching 13 Data Augmentation for Electrocardiograms with commentaries. In International Conference Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, on Learning Representations, 2021c. Jingkun Gao, Xue Wang, and Huan Xu. Time series data augmentation for deep learning: A survey. Sushravya Raghunath, Alvaro E Ulloa Cerna, arXiv preprint arXiv:2002.12478, 2020. Linyuan Jing, Joshua Stough, Dustin N Hartzel, Joseph B Leader, H Lester Kirchner, Martin C Clyde W Yancy, Mariell Jessup, Biykem Bozkurt, Javed Butler, Donald E Casey, Mark H Drazner, Stumpe, Ashraf Hafez, Arun Nemani, et al. PreGregg C Fonarow, Stephen A Geraci, Tamara Hordiction of mortality from 12-lead electrocardiogram wich, James L Januzzi, et al. 2013 accf/aha guidevoltage data using a deep neural network. Nature line for the management of heart failure: executive medicine, 26(6):886–891, 2020. summary: a report of the american college of cardiology foundation/american heart association task Stephen M Salerno, Patrick C Alguire, and Herbert S force on practice guidelines. Journal of the AmeriWaxman. Competency in interpretation of 12-lead can College of Cardiology, 62(16):1495–1539, 2013. electrocardiograms: a summary and appraisal of published evidence. Annals of Internal Medicine, 138(9):751–760, 2003. Daphne Schlesinger, Nathaniel Diamant, Aniruddh Raghu, Erik Reinertsen, Katherine Young, Puneet Batra, Eugene Pomerantsev, and Collin M. Stultz. A deep learning model for inferring elevated pulmonary capillary wedge pressures from the 12-lead electrocardiogram. 2021. Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48, 2019. Slawek Smyl and Karthik Kuber. Data preprocessing and augmentation for multiple short time series forecasting with recurrent neural networks. In 36th International Symposium on Forecasting, 2016. Peter Solin, Peter Bergin, Meroula Richardson, David M Kaye, E Haydn Walters, and Matthew T Naughton. Influence of pulmonary capillary wedge pressure on central apnea in heart failure. Circulation, 99(12):1574–1579, 1999. Terry T Um, Franz MJ Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 216–220, 2017. Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech Samek, and Tobias Schaeffter. PTB-XL, a large publicly available electrocardiography dataset. Scientific data, 7(1):1–15, 2020. 14 Data Augmentation for Electrocardiograms Original signal example Appendix A. Augmentation Methods Signal generated with SMOTE In this section, we provide further details on the different augmentation strategies explored (existing and TaskAug), and visualize their operation. A.1. Existing methods Figures 7-10 present examples following augmentation using the existing methods. We show only one lead for clarity; however, these operations will be applied to each lead. Original signal Figure 10: SMOTE. policy and an example of applying the different steps; (2) A more detailed description of the nested optimization algorithm used to learn TaskAug parameters, including a full algorithm; and (3) mathematical descriptions of the operations used in TaskAug in our experiments and a visualization of their effect on an ECG signal. After time masking A.2.1. Structure of policy Figure 7: Time Masking. Original signal Mathematical definition. As described in Section 4.2.1, the TaskAug policy is defined following Hataya et al. (2020). At each augmentation stage k ∈ {1, . . . , K} we have a set of operation selection P (k) parameters π (k) ∈ [0, 1]M , where i πi = 1 ∀k. Each vector π (k) parameterizes a categorical distri(k) bution such that each entry πi represents the probability of selecting operation i at augmentation stage k. We obtain a reparameterizable sample from this categorical distribution (using the Gumbel-Softmax trick, (Jang et al., 2016; Maddison et al., 2016)) at each stage to select the operation to use, as follows: After SpecAugment Figure 8: SpecAugment. Original signal u ∼ Categorical(π (k) ) # Note that u ∈ RM (5) i = arg max u (6) ui Ai (x, y; µ0 , µ1 ). (7) x̃ = stop grad(ui ) After DGW Why the multiplicative factor? We use the mului to allow gradient flow to tiplicative factor stop grad(u i) the operation selection parameters π. If we just selected i = arg max u and had no scaling in Eqn 7, then there would be no gradient flow to π, since the arg max operation is not differentiable. The denominator of this scaling factor is necessary because ui , obtained from the reparameterized sample from the categorical distribution, is not one-hot. The resulting fraction used as the scaling factor always has magnitude 1, since |stop grad(ui )| = |ui |. Figure 9: Discriminative Guided Warping (DGW). A.2. TaskAug We provide more details about TaskAug: (1) further information about the mathematical formalism of the 15 Data Augmentation for Electrocardiograms When we take the gradient, we get: re-express this second term as: −1  2 ∂ 2 LT ∂ LT ∂ θ̂ =− × T ∂φ ∂θ ∂θ ∂θ ∂φT ∂ 1 ui ∂ui = , ∂π stop grad(ui ) stop grad(ui ) ∂π so the stop grad(ui ) acts as a scaling term. , (8) θ̂(φ) which is a product of an inverse Hessian and a matrix of mixed partial derivatives. Adopting the algorithm from Lorraine et al. (2020), we approximate this with a truncated Neumann series with 1 term, and implicit vector-Jacobian products. Example application of TaskAug. Suppose we have a one-stage TaskAug policy, K = 1, our augmentation set has two operations S = {A1 , A2 } which are A1 = TimeMask(x, y; µ0 = 0.2, µ1 = 0.1) and A2 = Noise(x, y; µ0 = 2.1, µ1 = 5.3), and the operation selection probability vector is π = [0.9, 0.1] (that is, we select TimeMask with probability 0.9, and noise with probability 0.1). Now consider applying TaskAug to a (data, label) pair (x, 1), i.e., the label is 1. We follow these steps: 1. Obtain a reparameterizable sample u from Categorical([0.9, 0.1]): let this be u = [0.75, 0.25]. 2. Find i = arg max u; in this case, i = 1. 3. Select the operation A1 , i.e. TimeMask. 4. Compute the masking strength based on the label. Recall this is defined as s = yµ1 + (1 − y)µ0 , so s = 1 × 0.1 + (1 − 1) × 0.2 = 0.1. 5. Apply time-masking with strength 0.1 to x, generating x̂. u1 to generate x̃. 6. Scale this by stop grad(u 1) Training algorithm. Incorporating this gradient estimator, the algorithm to jointly optimize base model parameters and TaskAug policy parameters is given in Algorithm 1, mirroring the approach used in Raghu et al. (2021c). Algorithm 1 Optimizing TaskAug parameters. 1: Initialize base model parameters θ and TaskAug parameters φ 2: for t = 1, . . . , T do 3: Compute training loss, LT (θ) T 4: Compute ∂L ∂θ T 5: Update θ ← θ − ηθ ∂L ∂θ 6: if t % P == 0 then 7: Set θ̂ = θ 8: Compute the validation loss, LV (θ̂) V 9: Compute ∂L ∂ θ̂ A.2.2. Parameter optimization 10: Approximate ∂ θ̂ ∂φ using Equation 8. ∂LV ∂ θ̂ V Compute the derivative ∂L ∂φ = ∂ θ̂ × ∂φ As detailed in the main text, there are many learn- 11: using the previous two steps. able parameters in TaskAug, and we use gradientV 12: Update φ ← φ − ηφ ∂L ∂φ based optimization to learn these jointly with the 13: end if base model parameters. Here, we provide some more details about the estimation of the gradient wrt the 14: end for TaskAug parameters, and also include a full algorithm detailing the training procedure, Algorithm 1. Choice of P . The value of P influences how many Estimating TaskAug parameter gradients. ‘inner’ gradient steps (to the base model) we perform Let the base model parameters after P update steps before an ‘outer’ gradient step (to the TaskAug pabe denoted as θ̂(φ). We update the TaskAug policy rameters). There is a tradeoff here: if P is too small, ∂ θ̂ parameters to minimize the base model’s validation then applying the IFT to approximate ∂φ will result in a poor approximation (Lorraine et al., 2020); if loss LV , with the gradient of interest being: P is too large, then updates to the policy parameters will have little effect on model parameters since ∂LV ∂ θ̂ ∂LV = . × the base model has already reached minimal training ∂φ ∂φ ∂ θ̂ loss (and may start to overfit). In our experiments, The first term on the RHS can be found exactly us- we find that P > 5 suffered from this second probing standard backpropagation. To compute the sec- lem, and P = 1 was sometimes unstable due to the ond term, we re-express it using the implicit function first problem. In general, P = 1 worked well at small theorem (IFT) as in Lorraine et al. (2020). Using sample sizes (N = 1000), and P = 5 worked better LT to denote the training loss, the IFT allows us to at N = 2500 and N = 5000. 16 Data Augmentation for Electrocardiograms sigmoid(s) × Uniform(0.75, 1.25), where s is a learnable strength parameter, initialized to 0. • Temporal displacement. We shift the entire signal in time, padding with zeros where required. Our implementation directly generates a displacement field (as with temporal warping) and uses the spatial transformation from Balakrishnan et al. (2018, 2019) to transform the signal. This allows the operation to be differentiable, and for us to learn the displacement strength s. The displacement magnitude is a Uniform distribution on [−100 × s2 , 100 × s2 ], with the strength being initialized to 0.5. A.2.3. Augmentation operations Figure 11 shows the different operations used in TaskAug. We show only one lead for clarity; however, these operations will be applied to each lead. We now provide more details on the implementation of these operations in our experiments. • TimeMask. As with the existing TimeMask strategies, we randomly select a contiguous portion of the signal to set to zero. We set 10% of the signal to zero in our implementation. This parameter is not optimized. • Gaussian Noise. IID Gaussian noise is added to the signal. This is formed as follows. We first compute the standard deviation of each lead of the signal: let us denote this as σ. Then, the noise added to each sample of the signal is expressed as: ǫ = 0.25 × σ × sigmoid(s) × N (0, 1), where s is the learnable strength parameter, initialized to 0. The coefficient 0.25 was found by visual inspection of some augmented examples, and observing that this allowed flexible augmentations to be generated without overwhelming the signal with noise. • Temporal warping. The signal is warped with a random, diffeomorphic temporal transformation. To form this, we sample from a Gaussian with zero mean, and a fixed variance 100 × s2 , where s is the learnable strength parameter (initialized to 1), at each temporal location, to generate a length T dimensional random velocity field. This velocity field is then integrated (following the scaling and squaring numerical integration routine used by Balakrishnan et al. (2018, 2019)). This resulting displacement field is then smoothed with a Gaussian filter to generate the smoothed temporal displacement field. This field represents the number of samples each point in the original signal is translated in time. The field is then used to transform the signal, translating each channel in the same way (i.e., the field is the same across channels). • Baseline wander. We firstly form a wander amplitude by computing: A = 0.25 × sigmoid(s) × Uniform(0, 1), where again s is a learnable strength parameter. Then, we compute the frequency and phase of the sinusoidal offset. The frequency is , based on the computed as: f = 20×Uniform(0,1)+10 60 approximate number of breaths per minute for an adult. The phase is: φ = 2π×Uniform(0, 1). Then, the sinusoidal offset is computed as: A sin(f t + φ). • Magnitude scaling. We scale the entire signal by a random magnitude given by 17 Data Augmentation for Electrocardiograms Original signal After TaskAug Time Mask After TaskAug Gaussian Noise 2 2 2 1 1 1 0 0 0 1 1 1 After TaskAug Temporal Warp 2 After TaskAug Baseline Wander 2 3 After TaskAug Magnitude Scale 2 1 1 1 0 0 0 1 1 1 After TaskAug Temporal Displacement 2 1 0 1 Figure 11: Examples of the different operations used in TaskAug. 18 Data Augmentation for Electrocardiograms Appendix B. Dataset Details fibrillation with moderate ventricular response”, “fibrillation/flutter”, “atrial fibrillation with controlled ventricular response”, “afib”, “atrial fib”, “afibrillation”, “atrial fibrillation”, “atrialfibrillation”. Preprocessing. ECGs were sampled at 250 Hz for 10 seconds, resulting in a 2500 × 12 tensor for all 12 leads, per-ECG. We normalized the signals by dividing by 1000. Other forms of normalization for this dataset (e.g., z-scoring) resulted in some abnormally large/small values. We provide more details about the three datasets. B.2. Dataset B The four labels are obtained by aggregating relevant sets of diagnostic statements – we refer the reader to the PTB-XL paper (Wagner et al., 2020) for further details. Of relevance here is that certain labels, such as MI, contain a small number of distinct diagnostic statements (3), potentially suggesting why many augmentation strategies can help – it is a fine-grained task. Others (such as CD) are much broader, covering many more diagnostic statements. Preprocessing. ECGs in the dataset are sampled at 500 Hz for 10 seconds; we downsample these by a factor of 2 for consistency with Dataset A and C, resulting in a 2500 × 12 tensor for all 12 leads, perECG. Normalization involved z-scoring, following the code provided with the dataset. B.1. Dataset A B.3. Dataset C The hemodynamics prediction cohort consists of patients who had an ECG and right heart catheterization procedure on the same day. The catheterization procedure measures hemodynamics variables including the pulmonary capillary wedge pressure (PCWP) and cardiac output (CO), and these are used to form the prediction targets. We consider inferring abnormally low Cardiac Output (less than 2.5 L/min), and abnormally high Pulmonary Capillary Wedge Pressure (greater than 20 mmHg). The labels for RVH and AFib were assigned to each example based on whether relevant diagnostic statements were present in either a clinician’s read of the ECG, or a machine read of the ECG. For RVH, there were six diagnostic statements that led to a positive label being assigned: “right ventricular hypertrophy”, “biventricular hypertrophy”, “combined ventricular hypertrophy”, “right ventricular enlargement”, “rightventricular hypertrophy”, “biventriclar hypertrophy”. Preprocessing. ECGs were sampled at 250 Hz for 10 seconds, resulting in a 2500 × 12 tensor for all 12 leads, per-ECG. We normalized the signals by dividing by 1000. Other forms of normalization for this dataset (e.g., z-scoring) resulted in some abnormally large/small values, so we opted for the division-based normalization. For AFib, there were nine such statements: “atrial fibrillation with rapid ventricular response”, “atrial 19 Data Augmentation for Electrocardiograms Appendix C. Experiments In this section, we provide further experimental details. We first provide implementation details, and then outline additional experimental results including: Results for AUPRC in the low-sample (N = 1000) regime, performance on Datasets A and B in the high-sample regime, performance on Dataset B in an additional low sample regime (N = 500 data points), interpretation of the TaskAug policy for RVH, a study of the impact of optimizing policy parameters across different sample size regimes, and a study of the impact of class-specific magnitudes across different sample size regimes. C.1. Implementation details Network architecture. In all experiments, we use RVH AFib a 1D CNN based on a ResNet-18 (He et al., 2016) arNoAugs 7.4 ± 1.3 21.2 ± 2.0 chitecture. This model has convolutions with a kernel TaskAug 10.8 ± 0.8∗ 27.3 ± 1.8∗ size of 15, and stride 2 (informed by the temporal winSMOTE 9.7 ± 1.2 21.0 ± 2.2 dow we want the convolutions to operate over). The DGW 7.1 ± 0.9 19.4 ± 2.3 blocks in the ResNet architecture have convolutional SpecAug 10.6 ± 1.2 21.1 ± 2.0 layers with 32, 64, 128, and 256 channels respectively. TimeMask 10.1 ± 1.5 20.3 ± 2.3 The output after the final block is average pooled in the temporal dimension, and then a linear layer is apTable 4: Mean and standard error of AUPRC for various plied to predict the probability of the positive class. data augmentation strategies when detecting cardiac abnormalities on Dataset A. We consider a low-sample regime with a development set of 1000 data points. The best-performing method is bolded, and the second best is underlined, and ∗ indicates statistically significant improvement at the p < 0.05 level. TaskAug is the only method to obtain significant improvements in performance on both tasks. Optimization settings. As discussed, we used Adam with a learning rate of 1e-3 for all methods, given that this resulted in stable training across all settings. When optimizing the TaskAug policy parameters, we used RMSprop with a learning rate of 1e-2, following Lorraine et al. (2020). Computational information. All models and training were implemented in PyTorch and run on a single NVIDIA V100 GPU. C.2. Additional results AUPRC results at 1000 samples. As discussed in Section 5.2, the improvements in AUROC are not always statistically significant. Given that some of the labels are very low prevalence (RVH: 1%, AFib: 5%, low CO: 4%), we evaluate the AUPRC in the low-sample regime, which provides additional information about model performance. Results are shown in Tables 4, 5, and 6. We observe that for the low prevalence RVH, AFib, and Low CO tasks, TaskAug obtains statistically significant improvements in performance. On Dataset A tasks (RVH and AFib), it is the only method to do so. 20 Data Augmentation for Electrocardiograms NoAugs TaskAug SMOTE DGW SpecAug TimeMask MI HYP STTC 59.2±2.1 63.1±1.7 62.0±1.6 61.1±1.2 61.7±1.6 60.3±1.3 53.1±1.7 55.2±0.9 41.2±2.9 53.9±1.6 54.5±1.5 52.8±1.8 66.9±2.5 68.7±1.3 65.9±1.0 67.9±1.1 68.8±1.5 68.8±1.2 CD 67.3±1.1 66.8±1.2 62.7±1.1 64.7±2.6 65.8±1.4 70.1±1.3 Table 5: Mean and standard error of AUPRC for various data augmentation strategies on detecting cardiac abnormalities on Dataset B. We consider a low-sample regime with a development set of 1000 data points. The best-performing method is bolded, and the second best is underlined, and ∗ indicates statistically significant improvement at the p < 0.05 level. NoAugs TaskAug SMOTE DGW SpecAug TimeMask Low CO High PCWP: N = 1000 High PCWP: All Data 7.2 ± 0.4 8.8 ± 0.6∗ 8.8 ± 0.6∗ 8.1 ± 0.7 7.8 ± 0.4 8.0 ± 0.5 42.5 ± 0.8 43.5 ± 0.9 41.9 ± 0.7 41.2 ± 0.7 42.3 ± 1.1 42.4 ± 0.7 49.7 ± 0.8 50.8 ± 0.8 46.9 ± 0.7 49.7 ± 1.0 50.3 ± 0.8 50.1 ± 0.9 Table 6: Mean and standard error of AUPRC for various data augmentation strategies for the hemodynamics inference task in Dataset C. We consider a low-sample regime with a development set of 1000 data points. The best-performing method is bolded, and the second best is underlined, and ∗ indicates statistically significant improvement at the p < 0.05 level. TaskAug is the one of only two methods to obtain significant improvements in performance on the low CO detection task. 21 Data Augmentation for Electrocardiograms Results at higher sample regimes. Tables 7-10 show AUROC for the different augmentation methods on the tasks from Datasets A and B. We observe that augmentations are less effective at higher samples. Particularly when the development set sizes are 2500 and 5000 datapoints, we observe that the improvement with using augmentations (over the NoAugs baseline) with any of the methods is quite small, and nearly always less than 1% AUROC. This suggests that in general, augmentations are less useful at these higher data regimes. RVH NoAugs TaskAug SMOTE DGW SpecAug TimeMask 86.1±0.9 86.9±0.9 85.5±1.3 84.8±1.3 83.3±1.8 85.8±1.1 AFib 89.0 ± 0.4 89.1 ± 0.4 89.1 ± 0.5 88.4 ± 0.5 89.1 ± 0.3 88.2 ± 0.4 Table 7: Mean and standard error of AUROC for augmentation methods on Dataset A tasks with a development set of 2500 data points. The best performing method is bolded, and the second best is underlined. NoAugs TaskAug SMOTE DGW SpecAug TimeMask RVH AFib 90.6±0.6 90.6±0.4 89.8±0.6 90.8±0.5 90.5±0.8 89.4±0.7 92.6±0.2 92.8±0.1 92.6±0.2 92.5±0.2 92.7±0.1 92.6±0.2 Table 8: Mean and standard error of AUROC for augmentation methods on Dataset A tasks with a development set of 5000 data points. The best performing method is bolded, and the second best is underlined. 22 Data Augmentation for Electrocardiograms NoAugs TaskAug SMOTE DGW SpecAug TimeMask MI HYP STTC 84.5±0.5 86.1±0.5 84.7±0.7 84.1±0.5 84.6±0.8 85.7±0.4 86.4±0.4 86.2±0.4 81.9±1.3 85.9±0.6 86.2±0.6 86.6±0.3 89.7±0.3 89.7±0.3 88.7±0.4 89.5±0.3 90.2±0.3 90.1±0.1 CD 85.8±0.3 86.6±0.4 85.5±0.6 86.2±0.3 86.8±0.6 87.0±0.7 Table 9: Mean and standard error of AUROC for augmentation methods on Dataset B tasks with a development set of 2500 data points. The best performing method is bolded, and the second best is underlined. NoAugs TaskAug SMOTE DGW SpecAug TimeMask MI HYP STTC 89.4±0.3 89.4±0.3 86.6±0.7 88.6±0.3 89.5±0.2 89.3±0.3 88.2±0.2 88.3±0.2 86.7±0.4 88.0±0.2 88.4±0.4 88.6±0.2 91.0±0.3 91.6±0.2 90.6±0.3 91.3±0.1 91.6±0.2 91.6±0.2 CD 89.3±0.4 90.0±0.2 88.0±0.3 89.3±0.2 89.9±0.2 89.8±0.2 Table 10: Mean and standard error of AUROC for augmentation methods on Dataset B tasks with a development set of 5000 data points. The best performing method is bolded, and the second best is underlined. 23 Data Augmentation for Electrocardiograms Results on Dataset B at N = 500. Table 11 shows AUROC for the different augmentation methods in an additional low sample regime, with N = 500. We see that the maximum improvement over the NoAugs baseline by any augmentation strategy is greater in this regime than it was at N = 1000 (see Table 2). Given that the prevalence of these tasks is relatively high, we see more significant performance improvements in the N = 500 regime. 24 Data Augmentation for Electrocardiograms NoAugs TaskAug SMOTE DGW SpecAug TimeMask MI HYP STTC 74.4 ± 0.9 78.4 ± 0.5 75.7 ± 1.2 78.2 ± 0.6 77.8 ± 0.7 77.8 ± 1.0 81.9 ± 0.8 81.5 ± 1.2 79.2 ± 1.5 78.7 ± 1.2 81.0 ± 0.6 80.9 ± 1.3 85.2 ± 0.5 86.2 ± 0.4 85.5 ± 0.3 82.0 ± 1.3 86.3 ± 0.4 86.6 ± 0.5 CD 78.9 ± 1.2 80.7 ± 0.6 78.6 ± 1.5 79.0 ± 0.9 79.3 ± 1.1 80.3 ± 0.8 Table 11: Mean and standard error of AUROC for augmentation methods on Dataset B tasks with a development set of 500 data points. The best performing method is bolded, and the second best is underlined. 25 Data Augmentation for Electrocardiograms Interpreting the RVH policy. We visualize the TaskAug policy for RVH in Figure 12. We observe high probability assigned to selecting two temporal operations in stage 1, namely masking and displacement. Relative magnitudes of different portions of the ECG affect the RVH label, so temporal operations having higher probability of selection is sensible since they are more likely to be label preserving than operations that change the relative magnitudes of different parts of the ECG. We examine the learned strengths for the displacement operation in Stage 1, Figure 12(b), and we see that there is little differentiation on a per-class basis. This is sensible, since we do not expect displacement of the signal in time to affect the RVH label for differently for the positive and negative classes. 26 Data Augmentation for Electrocardiograms Probability of Operation Selection 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Stage 1 Stage 2 0.65 Stage 1 Displacement Strength 0.60 0.55 warp wander noise mask disp scale no-op warp wander noise mask disp (a) Operation selection probabilities scale no-op 0.50 Class 0 (No RVH) Class 1 (RVH) (b) Displacement strengths Figure 12: TaskAug policy for detecting Right Ventricular Hypertrophy. The learned TaskAug policy: probability of selecting each transformation in both augmentation stages and the optimized displacement strengths in the first stage. We show the mean/standard error of the learned parameter values over 15 runs. Temporal operations (masking and displacement) have high probability of selection in Stage 1, which is sensible since these operations are likely to be label preserving (RVH is typically detected based on relative magnitudes of portions of beats in the ECG). We see that both positive and negative classes have similar optimized displacement augmentation strengths – we do not expect displacement to impact the class label differently for the two classes, so this is sensible. 27 Data Augmentation for Electrocardiograms Further study on the impact of optimizing augmentations. As shown in the main text, Figure 5, optimizing the policy parameters improves performance over keeping them fixed at their initial values. In Figure 13, we study this effect across different dataset sizes and find that the optimization has the most impact in the low sample regime, but still results in improvements even at higher samples. This could be due to the fact that at higher samples, augmentations boost performance less in general, so the specific parameter settings in TaskAug also have less impact. Further study on the impact of class-specific magnitudes. As shown in the main text, Figure 6, optimizing class-specific magnitudes improves over learning one magnitude parameter for each class. Figure 14 studies this effect across different dataset sizes and we see that the class-specific parameters improve performance at all dataset sizes, but the improvement is most clearly seen at low samples. Similarly with the optimization of augmentation parameters, this could be due to the fact that at higher samples, augmentations boost performance less in general, so the classspecific parameterization in TaskAug has less impact. 28 Data Augmentation for Electrocardiograms AFib 92.5 MI 90 90.0 85 AUROC AUROC 87.5 85.0 80 82.5 Init Augs Optimized Augs 80.0 1000 2500 Number of datapoints Init Augs Optimized Augs 75 5000 5001000 2500 Number of datapoints 5000 Figure 13: Studying performance when we do not optimize the policy parameters in TaskAug. We show the mean/standard error of AUROC over 15 runs for AFib and over 5 runs for MI. We see that optimizing the policy parameters results in noticeable improvements in performance over keeping the policy parameters at their initial values (InitAugs). However, the impact of optimizing the parameters is reduced at larger dataset sizes, possibly due to the fact that augmentations are inherently less useful at higher sample regimes. 90.0 87.5 85.0 82.5 80.0 77.5 75.0 MI AUROC AUROC 92.5 90.0 87.5 85.0 82.5 80.0 77.5 AFib Not Class-Specific Mags Class-Specific Mags 1000 2500 Number of datapoints 5000 Not Class-Specific Mags Class-Specific Mags 5001000 2500 Number of datapoints 5000 Figure 14: Studying performance when we do not have class-specific magnitude parameters in TaskAug. We show the mean/standard error of AUROC over 15 runs for AFib and over 5 runs for MI. Class-specific magnitude parameters improve performance most in the low sample regime. At higher samples, this impact is reduced, possibly due to the fact that augmentations are inherently less useful at higher sample regimes. 29