ECG Segmentation by Neural Networks: Errors and Correction
ECG Segmentation by Neural Networks: Errors and Correction
ECG Segmentation by Neural Networks: Errors and Correction
Iana Sereda,1 Sergey Alekseev,1 Aleksandra Koneva,1 Roman Kataev,1 and Grigory Osipov1
1
Department of Control Theory, Nizhny Novgorod State University,
Gagarin Av. 23, Nizhny Novgorod, 603950, Russia
Abstract
In this study we examined the question of how error correction occurs in an ensemble of deep
convolutional networks, trained for an important applied problem: segmentation of Electrocardio-
grams(ECG). We also explore the possibility of using the information about ensemble errors to
evaluate a quality of data representation, built by the network. This possibility arises from the
effect of distillation of outliers, which was demonstarted for the ensemble, described in this paper.
Keywords: convolutional neural networks, cardiac cycle, segmentation, ensemble, outliers, errors
arXiv:1812.10386v1 [cs.LG] 26 Dec 2018
V. TRAINING
IV. BASE NEURAL NETWORK In this work, we define annotation as designation of the
beginning and end points for each of the waves/complexes
The architecture of the base neural network to be used specified. To evaluate the annotation quality for a partic-
is shown in fig. 3. It was shown that the convolutional ular type of points, (such as the P-wave starting points)
neural network architecture is well-suited for many real we employ an algorithm that works as follows: for each
world problems, such as medical signal processing and point of this type on the doctor’s annotation, the algo-
analysis, including ECG signal annotation[12]. rithm looks for the corresponding point on the network’s
The output signal is comprised of four channels: one annotation.
contains the annotation mask for the P peak, the second If a corresponding point is found in the specified neigh-
for the QRS complex, the third channel has the mask for borhood of the doctor’s point, then we count the net-
the T peak. The fourth channel is auxiliary, and is equiv- work’s decision as valid (True Positive, TP). In this case
alent to background commonly found in the segmentation the error value is calculated as the distance between the
problem. Examples of input and their corresponding out- point in the doctor’s annotation and the corresponding
put signals are shown in pictures 6, 7. Softmax function point in the network’s annotation.
3
• σ 2 – error variance It turned out that the presence of pathology has the
most noticeable effect on the quality of the neural net-
work’s performance.
TP
• Se = T P +F N – sensitivity When analyzing the network’s performance on a test
sample of patients, it turned out that the following rule
of thumb is valid: if the case is pathological, it can still
TP
• PPV = T P +F P – positive predictive value be properly annotated by a common neural network (for
example, 5 depicts an ECG with a non-standard T-wave
shape and the network had annotated it well). But if an
Values of these metrics for the main network can be seen
in table I. The performance of the base network is com-
parable to the quality of direct methods [9] and may
seem relatively good, but in the following sections we
will demonstrate that even though these formal metrics’
values appear to be promising, they are not trustwor-
thy. They do not account for the degree of representation
of various pathologies within the dataset for which ECG
segmentation is performed. For example, if the majority FIG. 5: This graph shows an abnormal case (containing
of samples belongs to healthy patients, and if the seg- unusual T-waves) being annotated by a base network
mentation algorithm handles the standard healthy case with satisfactory results.
right, but not the pathological one, then its quality assess-
ment will directly depend on the number of pathological
or artifact-containing samples in the data set. ECG was erroneously annotated by the neural network,
So it is important to investigate the behavior of the net- then this ECG is pathological (or unreadable due to arti-
work on pathological cases rather than relying on formal facts of the recording process).
metrics. This is especially true for the QRS complex. The QRS
complex turned out to be the easiest to mark up with a
neural network (as well as direct algorithms). Typically,
it has the largest signal amplitude, although this is not
VII. MAIN NETWORK TRENDS
always true, for example see fig. 4.
If we study the raw output of the network for an ECG
In this section we describe some qualities that the base which does not contain any noticeable pathologies and
deep network demonstrates while being trained through- compare it against the output for a markedly pathological
out the training dataset. ECG, we can notice a systematic difference. In the patho-
logical case, the network’s signal contains numerous asym-
metric low-amplitude spikes, which do not contribute to
A. Noise stability the resulting annotation (see pic.7). However, we do not
observe this kind of behavior in the non-pathological case.
All else being equal, the presence of high-frequency Should the channel contain a signal spike, it is smooth and
noise does not interfere with the network’s ability to has a large enough amplitude to influence the annotation
produce a correct annotation, nor does it influence the (pic. 6). In a sense, the intensity of this effect can be
smoothness of its output signal. interpreted as the network’s "confidence".
4
TABLE I: Quality metrics for the base network. Values are averaged across 20 networks.
VIII. ENSEMBLE
bad local minima.
Figure 9 demonstrates an example of how the size of the
The ensemble formation procedure was designed in such
training set has changed during the procedure described.
a way that adding a new network to the ensemble fixes
some errors of the already existing networks on the train-
When the ensemble is created, the resulting annotation
ing set.
for every input ECG can be obtained by averaging the
After training, the F-score of the base network was raw output signals across all members of the ensemble.
measured on each patient of the training sample, so that
one could see in which cases the network does not perform
well. Then, the procedure of iterative ensemble building
starts. All patients rated at an F-score of 0.99 and above IX. ERROR CORRECTION IN ACTION
were removed from the training set. The rest of the train-
ing set is then used as a separate training set for the new From the very ensemble construction algorithm itself,
neural network. one can see that the ensemble is able to systematically
This new neural network is then trained on that train- correct some errors of the base network. Also, members
ing set and, again, the procedure of screening out patients of the ensemble can correct each other’s mistakes. To il-
is carried out: all the cases on which this neural network lustrate that, we provide are a couple of typical examples:
has failed to achieve a score of 0.99 remain while others fig. 9 demonstrates how the 4th member of ensemble fixes
are deleted. The procedure described is repeated until the error of the 3rd network in case of an abnormal ECG.
the patients in the training set run out. Fig. 10 depicts how the ensemble itself fixes an error of
At each iteration of this algorithm, a new neural net- the base network for an abnormal case (patient was taken
work is created. Each of these networks is trained on an from test set).
ever decreasing data set. If, after one step, the sample Networks added to the ensemble at later iterations were
size has not changed, then the same network is re-trained trained on very small patient subsets (for example, on
on the same sample, on the assumption that it fell into a fig. 8 it is clearly visible that at least half of networks
5
X. CONCLUSION
FIG. 12: F-score scattergrams for every patient
demonstrates distillation effect. The horizontal axis The ensemble generation procedure allowed us to con-
corresponds to the patients, the vertical axis corresponds sider simultaneously many different local minima of the
to the F-score of segmentation given by an ensemble (or) loss function. The differences between the minima were
the base network. Top left: the base network the significant in the sense that the networks were required
annotates training set. Top right: the ensemble to give significantly different markup for subsets of ECGs.
annotates training set. Bottom left and bottom right: As expected, this has led to an improvement of the results
the annotation for the test part when performed by the of the base network. However, these improvements are
base network and an ensemble respectively not uniform across the dataset, which is interesting. A
substantial group of samples is distinguished, which were
annotated with a base network with minor errors (F-score
sets, which means the behavior of the ensemble on the higher then 0.9), and after using the ensemble, they be-
training set can be generalized. came annotated close to the ideal (F-score close to 1).
Another conclusion is that outlier patients still exist There is also a group of samples, which the base network
even for the ensemble. An example of an outlier ECG is annotated with errors (usually significant) and the en-
depicted in pic. 13. semble does not improve a situation for them (moreover,
it often worsens the segmentation of the base network in
such a case).
This leads to the assumption that the cases from the
second set are fundamentally difficult for the considered
neural network with adopted training procedure. Sur-
prisingly, the presence of pathology on the ECG does not
guarantee that this ECG will fall to the second set. It de-
pends on the type of pathology, which means that among
the pathologies there are "simple" and "complex" ones.
XI. DISCUSSION