ECG Segmentation by Neural Networks: Errors and Correction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

ECG Segmentation by Neural Networks: Errors and Correction∗

Iana Sereda,1 Sergey Alekseev,1 Aleksandra Koneva,1 Roman Kataev,1 and Grigory Osipov1
1
Department of Control Theory, Nizhny Novgorod State University,
Gagarin Av. 23, Nizhny Novgorod, 603950, Russia
Abstract

In this study we examined the question of how error correction occurs in an ensemble of deep
convolutional networks, trained for an important applied problem: segmentation of Electrocardio-
grams(ECG). We also explore the possibility of using the information about ensemble errors to
evaluate a quality of data representation, built by the network. This possibility arises from the
effect of distillation of outliers, which was demonstarted for the ensemble, described in this paper.
Keywords: convolutional neural networks, cardiac cycle, segmentation, ensemble, outliers, errors
arXiv:1812.10386v1 [cs.LG] 26 Dec 2018

I. INTRODUCTION When creating a quality metric for an arbitrary task, it


is difficult to take into account the imbalance for all the
Correction of errors of Artificial Intelligence (AI) sys- hidden factors of influence existing in this task. In this
tems is recognized as one of the main problems in the paper, we use data on how exactly the ensemble corrects
AI-based technical revolution [1]. The effect of error cor- the errors of the underlying network in order to conclude
rection often appears in ensembles of neural networks: it about the quality of data representation received by the
is known that, in most cases, an ensemble can improve network.
the effectiveness of the base network [2]. The creation of One of the ways to investigate the reliability of the data
an ensemble of models is widely used in modern machine representation received within the network is to use ad-
learning as the last step of the working pipeline. However, versarial examples [5]. Another common approach to an-
it is difficult to predict which mistakes the ensemble can alyzing the quality of the representation of deep networks
eliminate from the basic model and which can not. is based on the visualization of the learned attributes of
This problem of possible mistakes of the trained model different levels [6]. For models with attention, attention
remains relevant because the representation of the data visualization can be used [7]. A new direction is to find
learned by the neural network is difficult to interpret [3]. a metric for evaluating the degree of disentanglement in
The reliability of a neural network is directly connected representations [8]. Other methods can be found in a
to the quality of the internal data representation that it survey [3].
has built. In the context of medical tasks, the problem This paper is organized as follows. First, we describe
of analyzing the quality of representation (and fixing the the details of training of a convolutional network on an
flows in it) is especially important due to the peculiarities ECG segmentation dataset. Results are averaged over
of medical datasets: different pathologies are often repre- multiple runs and are measured by means of the usual
sented by a small number of samples, while variants close metrics adopted to measure quality in this medical task
to the norm may occur too often [4]. However, it is the (see section VI) . In the last sections we investigate the er-
pathological cases that are most important. ror correction by an iteratively formed network ensemble
Imbalance of the data set often leads to the situation and demonstrate the effect of distilling such pathological
there formal quality metrics can give unreasonably good cases to which the network of a given architecture is most
result, while the network could not cover all important vulnerable, given that data set.
aspects of data well. It is not always possible to com- The project’s code is publiclly available at:
bat data imbalances with well-known methods (such as, https://github.com/Namenaro/ecg_segmentation
for example, oversampling), because is not always clearly
visible which particular classes require balancing. We il-
lustrate the above problem using the example of the ECG II. DATASET
markup task: all meaningful components of cardiac cy-
cle ( P-wave, T-wave and the QRS complex) are roughly LUDB[9] is an open access dataset, containing ECG
balanced in most of ECGs due to the periodicity of ECG recordings of 200 unique patients. Each recording is rep-
structure. But the task itself contains imbalance, because resented by a 10-second signal registered from twelve leads
the dataset is not balanced for diseases. Diseases change with a sampling rate of 500 Hz. An expert’s annotation is
the morphology of the components of the cardiac cycle in provided for each patient, annotating the three segments
different ways, so the representation built by the neural of the cardiac cycle: P, QRS and T. The proper detec-
network must contain information of how the cardiac cy- tion of these waves/complexes is essential for ECG-based
cle looks like for every pathology presented in the dataset. diagnostics of the cardiovascular system. A schematic
representation of the cardiac cycle is shown on fig. 1. A
significant part of the dataset is represented by healthy
∗ The study was supported by the Ministry of Education and Sci- cases, and the remaining part covers a wide range of dif-
ence of Russia (Project No. 14.Y26.31.0022). ferent pathologies of the cardiovascular system (different
2

FIG. 1: Schematic cardiac cycle

FIG. 3: An 8-layer convolutional architecture for which


heart rhythm types, conduction abnormalities, repolar- this error analysis was performed
ization abnormalities and so on). The dataset also con-
tains ECGs with varying degree of noise.
is applied to each pixel of the annotation at the last layer
of the network, which is fully connected. This function
III. DATA PREPARATION works with various channels and therefore lays an a pri-
ori assumption that activity on one of the output channels
ECG preprocessing has consisted of Baseline wan- must tend to exclude any activity in the other channels.
der(BW) removal, which is a conventional first step in The end result represents binary masks for all types of
ECG processing for most applications[10]. Baseline wan- peaks. For each channel, the mask is generated individ-
der is a low-frequency ECG artifact, which may be caused ually based on the output signal (pic. 6) as follows: for
by patients’ breath or movement [11] and holds no diag- each point in time, a “winner” channel is selected (one
nostic information. ECG was filtered with two median which has the largest value at that time point), which
filters as described in [10]. The resulting ECG is shown gains 1, the rest of the channels gain 0.
in fig. 2. High frequency noise was not removed, and no

V. TRAINING

Training was conducted on one ECG lead, because ex-


periments have shown that the addition of the remaining
ECG leads does not bring any statistically significant im-
provements to the results of the neural network. The data
set was split into training (134 patients) and test (66 pa-
tients) parts. No post-processing was performed on the
resulting segmentation. During training, the mini-batches
are formed out of randomly selected 6-second intervals.
FIG. 2: BW removal example. ECG signal before and Other parameters of training are: the chosen optimizer
after processing is shown at the top and bottom is RMSProp and the loss function is categorical cross en-
respectively tropy.

augmentation was performed.


VI. EVALUATION

IV. BASE NEURAL NETWORK In this work, we define annotation as designation of the
beginning and end points for each of the waves/complexes
The architecture of the base neural network to be used specified. To evaluate the annotation quality for a partic-
is shown in fig. 3. It was shown that the convolutional ular type of points, (such as the P-wave starting points)
neural network architecture is well-suited for many real we employ an algorithm that works as follows: for each
world problems, such as medical signal processing and point of this type on the doctor’s annotation, the algo-
analysis, including ECG signal annotation[12]. rithm looks for the corresponding point on the network’s
The output signal is comprised of four channels: one annotation.
contains the annotation mask for the P peak, the second If a corresponding point is found in the specified neigh-
for the QRS complex, the third channel has the mask for borhood of the doctor’s point, then we count the net-
the T peak. The fourth channel is auxiliary, and is equiv- work’s decision as valid (True Positive, TP). In this case
alent to background commonly found in the segmentation the error value is calculated as the distance between the
problem. Examples of input and their corresponding out- point in the doctor’s annotation and the corresponding
put signals are shown in pictures 6, 7. Softmax function point in the network’s annotation.
3

If a point specified by the network does not exist on


the doctor’s annotation in the specified neighbourhood,
we count the answer as false positive (FP). Should the
network be unable to locate a point which is present in
the doctor’s annotation, then the answer is marked as
false negative (FN).
The permitted neighbourhood is calculated adaptively FIG. 4: Noised ECG is annotated correctly. The bottom
depending on the patient’s heart rate: for a heart rate row of colored boxes shows the annotation given by an
of 70 BPM, the radius of the permitted neighbourhood expert, the top row depicts the annotation by the neural
was chosen to be 150 ms. Then, the size of this neigh- network, it’s architecture is depicted in fig. 3
bourhood is decreased linearly based on the length of the
cardiac cycle. An interval of 150 milliseconds was selected
in accordance with ANSI/AAMI-EC57:1998. Absence of necessity of noise filtration is an advantage
The following quality metrics are commonly used for of the deep learning approach, since it reduces the time
ECG annotation evaluation: spent on ECG preprocessing.

• m – expected value of error B. Reaction to pathologies

• σ 2 – error variance It turned out that the presence of pathology has the
most noticeable effect on the quality of the neural net-
work’s performance.
TP
• Se = T P +F N – sensitivity When analyzing the network’s performance on a test
sample of patients, it turned out that the following rule
of thumb is valid: if the case is pathological, it can still
TP
• PPV = T P +F P – positive predictive value be properly annotated by a common neural network (for
example, 5 depicts an ECG with a non-standard T-wave
shape and the network had annotated it well). But if an
Values of these metrics for the main network can be seen
in table I. The performance of the base network is com-
parable to the quality of direct methods [9] and may
seem relatively good, but in the following sections we
will demonstrate that even though these formal metrics’
values appear to be promising, they are not trustwor-
thy. They do not account for the degree of representation
of various pathologies within the dataset for which ECG
segmentation is performed. For example, if the majority FIG. 5: This graph shows an abnormal case (containing
of samples belongs to healthy patients, and if the seg- unusual T-waves) being annotated by a base network
mentation algorithm handles the standard healthy case with satisfactory results.
right, but not the pathological one, then its quality assess-
ment will directly depend on the number of pathological
or artifact-containing samples in the data set. ECG was erroneously annotated by the neural network,
So it is important to investigate the behavior of the net- then this ECG is pathological (or unreadable due to arti-
work on pathological cases rather than relying on formal facts of the recording process).
metrics. This is especially true for the QRS complex. The QRS
complex turned out to be the easiest to mark up with a
neural network (as well as direct algorithms). Typically,
it has the largest signal amplitude, although this is not
VII. MAIN NETWORK TRENDS
always true, for example see fig. 4.
If we study the raw output of the network for an ECG
In this section we describe some qualities that the base which does not contain any noticeable pathologies and
deep network demonstrates while being trained through- compare it against the output for a markedly pathological
out the training dataset. ECG, we can notice a systematic difference. In the patho-
logical case, the network’s signal contains numerous asym-
metric low-amplitude spikes, which do not contribute to
A. Noise stability the resulting annotation (see pic.7). However, we do not
observe this kind of behavior in the non-pathological case.
All else being equal, the presence of high-frequency Should the channel contain a signal spike, it is smooth and
noise does not interfere with the network’s ability to has a large enough amplitude to influence the annotation
produce a correct annotation, nor does it influence the (pic. 6). In a sense, the intensity of this effect can be
smoothness of its output signal. interpreted as the network’s "confidence".
4

TABLE I: Quality metrics for the base network. Values are averaged across 20 networks.

Value P begin P end QRS begin QRS end T begin T end


Se (%) 95.20 95.39 99.51 99.50 97.95 97.56
PPV(%) 82.66 82.59 98.17 97.96 94.81 94.96
m ± σ(ms) 2.7±21.9 −7.4±28.6 2.6±12.4 −1.7±14.1 8.4±28.2 −3.1±28.2

FIG. 6: A simple case of ECG annotation. On the top


graph, the bottom set of colored markings shows the
doctor’s manual annotation for P waves (red), QRS
complexes (green) and T waves (blue). The set of
colored markings above represents the network’s FIG. 7: Distinctive features of the network’s output
annotation for the same ECG. The bottom graph shows signal in a markedly pathological case. The top graph
the raw output signal of the network. These values contains annotations for QRS complexes and T waves in
represent the network’s “confidence” in the fact that the green and blue respectively. The bottom set of markings
current segment does contain the appropriate ECG represents the doctor’s annotation, the network’s
waveform. For a simple case of a healthy ECG, we can annotation is located above. The P wave (red) is lacking
see that the network performs well when compared completely, a fact which was both noted by the
against a professional annotation done by a cardiologist. cardiologist and the network. Non-smooth asymmetric
Smooth symmetric waves in the output signal of the waves of arbitrary amplitude in the output signal of the
network are characteristic of the segmentation of simple network are characteristic of segmentation of an
cases (like this one) electrocardiogram with severe pathology (like in this
case) and are shown by black arrows.

VIII. ENSEMBLE
bad local minima.
Figure 9 demonstrates an example of how the size of the
The ensemble formation procedure was designed in such
training set has changed during the procedure described.
a way that adding a new network to the ensemble fixes
some errors of the already existing networks on the train-
When the ensemble is created, the resulting annotation
ing set.
for every input ECG can be obtained by averaging the
After training, the F-score of the base network was raw output signals across all members of the ensemble.
measured on each patient of the training sample, so that
one could see in which cases the network does not perform
well. Then, the procedure of iterative ensemble building
starts. All patients rated at an F-score of 0.99 and above IX. ERROR CORRECTION IN ACTION
were removed from the training set. The rest of the train-
ing set is then used as a separate training set for the new From the very ensemble construction algorithm itself,
neural network. one can see that the ensemble is able to systematically
This new neural network is then trained on that train- correct some errors of the base network. Also, members
ing set and, again, the procedure of screening out patients of the ensemble can correct each other’s mistakes. To il-
is carried out: all the cases on which this neural network lustrate that, we provide are a couple of typical examples:
has failed to achieve a score of 0.99 remain while others fig. 9 demonstrates how the 4th member of ensemble fixes
are deleted. The procedure described is repeated until the error of the 3rd network in case of an abnormal ECG.
the patients in the training set run out. Fig. 10 depicts how the ensemble itself fixes an error of
At each iteration of this algorithm, a new neural net- the base network for an abnormal case (patient was taken
work is created. Each of these networks is trained on an from test set).
ever decreasing data set. If, after one step, the sample Networks added to the ensemble at later iterations were
size has not changed, then the same network is re-trained trained on very small patient subsets (for example, on
on the same sample, on the assumption that it fell into a fig. 8 it is clearly visible that at least half of networks
5

FIG. 10: This figure shows the annotation for an


abnormal case. The doctor’s annotation is shown at the
bottom, a single network’s annotation is located in the
middle and the ensemble’s annotation is at the top. The
FIG. 8: Number of patients remaining in the train improved quality of segmentation since the introduction
subset at every stage of ensemble formation. The of the ensemble is clearly visible.
formation continues until the number of patients in the
subset reaches zero.
illustrates the amount of well annotated patients from
the unseen part of the whole training set. We can clearly
see that there are only two networks which are probably
significantly overfitted (the first network doesn’t have its
own test set, so we cannot make any judgments about it).

FIG. 9: Noticeable improvement of annotation quality


through consecutive stages of ensemble training. The
doctor’s annotation is shown at the bottom, the 4th
network’s annotation is located in the middle and the
3rd network’s annotation is at the top. The 3rd network
mistakenly notes the P-wave, blue. The remaining
components of the cycle (green and red colors) are
defined by both networks correctly.
FIG. 11: This image demonstrates the generalization
were trained on less than 10 patients). Every network capability of every individual member of the ensemble.
has 60,804 trainable parameters. This situation obliges The ensemble members capability to annotate it’s part
us to check for the degree of overfitting in such networks. of the dataset exceptionally well( i.e. F-score not less
In order to do so, we have designed a simple procedure, then 0.99) is shown in dark green, while it’s ability to
which roughly evaluates a degree of overfitting of all but annotate the previously unseen part of the dataset
one networks in an ensemble without need to use the test exceptionally well is shown in light green.
sample. The key is that every member of the ensemble
(except for the first one) is only trained on some subset
of the training set of patients. This allows them to use To get an idea of the behavior of an ensemble built
the unseen part of the training set as their test set and according to the principle described above, we visualized
therefore to evaluate their generalization ability. its behavior on a test and training set. We then compared
Surprisingly, experimental results show that there are this behavior to the behavior of the underlying network.
ensemble members that can produce good annotations The resulting visualization is shown in Fig. 12.
(F-score higher than 0.99) for tens of unseen patients The left subfigures demonstrate how the base network
despite having been trained on 2-3 patients. Picture 11 annotates testing and training sets, while the right part
illustrates the behavior of an ensemble of 12 members: depicts how the ensemble performs on them. Answers of
for every member of the ensemble, the dark green bar the ensemble are more concentrated in a very narrow area
shows the amount of well annotated patients from the of high F-scores near the 0.99 value. The shape of point
training set of that member, and the light green bar clouds is the same for both the testing and the training
6

Moreover, outliers are now better highlighted than be-


fore. After the ensemble processing, the data set turned
out to be divided into two classes: a superconcentrated
cloud and a very rarefied one. Interestingly, despite the
fact that the overall F-score of the ensemble is higher than
that of the base network (for the base network it is 0.94
and for the ensemble is more then 0.95), the proportion of
outliers detected by the ensemble is not less (and is even
slightly higher) than that for the base network.
This "distillation" effect, in some cases, may possibly
help to assess which types of errors can be corrected by
searching for other local minima for this model, and which
will most likely require other architectures or other train-
ing procedures.

X. CONCLUSION
FIG. 12: F-score scattergrams for every patient
demonstrates distillation effect. The horizontal axis The ensemble generation procedure allowed us to con-
corresponds to the patients, the vertical axis corresponds sider simultaneously many different local minima of the
to the F-score of segmentation given by an ensemble (or) loss function. The differences between the minima were
the base network. Top left: the base network the significant in the sense that the networks were required
annotates training set. Top right: the ensemble to give significantly different markup for subsets of ECGs.
annotates training set. Bottom left and bottom right: As expected, this has led to an improvement of the results
the annotation for the test part when performed by the of the base network. However, these improvements are
base network and an ensemble respectively not uniform across the dataset, which is interesting. A
substantial group of samples is distinguished, which were
annotated with a base network with minor errors (F-score
sets, which means the behavior of the ensemble on the higher then 0.9), and after using the ensemble, they be-
training set can be generalized. came annotated close to the ideal (F-score close to 1).
Another conclusion is that outlier patients still exist There is also a group of samples, which the base network
even for the ensemble. An example of an outlier ECG is annotated with errors (usually significant) and the en-
depicted in pic. 13. semble does not improve a situation for them (moreover,
it often worsens the segmentation of the base network in
such a case).
This leads to the assumption that the cases from the
second set are fundamentally difficult for the considered
neural network with adopted training procedure. Sur-
prisingly, the presence of pathology on the ECG does not
guarantee that this ECG will fall to the second set. It de-
pends on the type of pathology, which means that among
the pathologies there are "simple" and "complex" ones.

XI. DISCUSSION

The number of samples from the aforementioned class


number two can be used as a basis for assessing the quality
of data representation that a network of a given architec-
ture can build. In the case considered, for example, it
can be concluded that data representation built by the
FIG. 13: An example of an outlier ECG on which both –
base network has serious flaws, despite the fact that for-
the ensemble and the base network – fail: all the
mal quality metrics (such as specificity, positive predictive
members have failed to segment the T-vawe (green),
value and F-measure) show relatively high values. Also, a
potentially because the amplitude of the T-wave has
promising area of research could be the search for a reli-
become comparable in scale to the amplitude of the
able metric that assesses the quality of network represen-
noise. Colored blocks indicate the markup of the expert.
tation while doesn’t consider the network as a black box.
The top markings show the output signals of the
The creation of such metrics probably requires the further
ensemble members.
development of mathematical theory for deep learning.
7

[1] AN Gorban, A Golubkov, B Grechuk, EM Mirkes, and


IY Tyukin, “Correction of ai systems by linear discrim-
inants: Probabilistic foundations,” Information Sciences
466, 303–322 (2018).
[2] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu,
John E Hopcroft, and Kilian Q Weinberger, “Snap-
shot ensembles: Train 1, get m for free,” arXiv preprint
arXiv:1704.00109 (2017).
[3] Quan-shi Zhang and Song-Chun Zhu, “Visual inter-
pretability for deep learning: a survey,” Frontiers of In-
formation Technology & Electronic Engineering 19, 27–39
(2018).
[4] Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin
Aryani Abd Rahman, Simon Fong, Zuraida Khairudin,
and Nik Nik Abdullah, “An application of oversampling,
undersampling, bagging and boosting in handling imbal-
anced datasets,” in Proceedings of the First International
Conference on Advanced Data and Information Engineer-
ing (DaEng-2013) (Springer, 2014) pp. 13–22.
[5] Anurag Arnab, Ondrej Miksik, and PH Torr, “On the ro-
bustness of semantic segmentation models to adversarial
attacks,” arXiv preprint arXiv:1711.09856 2 (2017).
[6] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs,
and Hod Lipson, “Understanding neural networks through
deep visualization,” arXiv preprint arXiv:1506.06579
(2015).
[7] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and
Yoshua Bengio, “Show, attend and tell: Neural image cap-
tion generation with visual attention,” in International
conference on machine learning (2015) pp. 2048–2057.
[8] Cian Eastwood and Christopher KI Williams, “A frame-
work for the quantitative evaluation of disentangled rep-
resentations,” (2018).
[9] Igor I.and Moskalenko Victor A. Kalyakulina, Alena
I.and Yusipov, Artem A. Nikolskiy, Alexander V.and Ko-
zlov, Konstantin A. Kosonogov, Nikolay Yu. Zolotykh,
and Mikhail V. Ivanchenko, “Lu electrocardiography
database: a new open-access validation tool for delin-
eation algorithms,” eprint arXiv:1809.03393 (2018).
[10] Philip De Chazal, Maria O’Dwyer, and Richard B Reilly,
“Automatic classification of heartbeats using ecg mor-
phology and heartbeat interval features,” IEEE transac-
tions on biomedical engineering 51, 1196–1206 (2004).
[11] Gustavo Lenis, Nicolas Pilia, Axel Loewe, Walther HW
Schulze, and Olaf Dössel, “Comparison of baseline wan-
der removal techniques considering the preservation of st
changes in the ischemic ecg: a simulation study,” Com-
putational and mathematical methods in medicine 2017
(2017).
[12] Pranav Rajpurkar, Awni Y Hannun, Masoumeh Hagh-
panahi, Codie Bourn, and Andrew Y Ng, “Cardiologist-
level arrhythmia detection with convolutional neural net-
works,” arXiv preprint arXiv:1707.01836 (2017).

You might also like